Using the available csv files and by scraping data from the website besoccer.com, I have compiled a dataset of 161,112 football matches in 14 leagues with 55 columns of relevant information about them.

In [2]:
import pandas as pd
data = pd.read_csv('cleaned_dataset.csv')
data.columns

Index(['Result', 'Season', 'Round', 'Teams_in_League',
       'Home_Team_Goals_For_This_Far', 'Home_Team_Goals_Against_This_Far',
       'Away_Team_Goals_For_This_Far', 'Away_Team_Goals_Against_This_Far',
       'Home_Team_Points', 'Away_Team_Points', 'Home_Team_Losing_Streak',
       'Away_Team_Losing_Streak', 'Home_Team_Winning_Streak',
       'Away_Team_Winning_Streak', 'Home_Team_Unbeaten_Streak',
       'Away_Team_Unbeaten_Streak', 'Elo_home', 'Elo_away',
       'Home_Wins_This_Far', 'Home_Draws_This_Far', 'Home_Losses_This_Far',
       'Away_Wins_This_Far', 'Away_Draws_This_Far', 'Away_Losses_This_Far',
       'Home_Wins_This_Far_at_Home', 'Home_Draws_This_Far_at_Home',
       'Home_Losses_This_Far_at_Home', 'Home_Wins_This_Far_Away',
       'Home_Draws_This_Far_Away', 'Home_Losses_This_Far_Away',
       'Away_Wins_This_Far_at_Home', 'Away_Draws_This_Far_at_Home',
       'Away_Losses_This_Far_at_Home', 'Away_Wins_This_Far_Away',
       'Away_Draws_This_Far_Away', 'Away_Losses_Thi

By training models in succession and by tuning their hyper-parameters it is possible to find a model which can use this information to accurately predict future results. Feature selection was used to resize the data to remove irrelevant columns which allows the model to train more quickly.

Using a correlation metric gave the relevant columns as 

In [4]:
svm_cols = ['Season', 'Teams_in_League', 'Home_Team_Goals_For_This_Far',
            'Home_Team_Goals_Against_This_Far', 'Away_Team_Goals_For_This_Far',
            'Away_Team_Goals_Against_This_Far', 'Home_Team_Points',
            'Away_Team_Points', 'Away_Team_Winning_Streak',
            'Home_Team_Unbeaten_Streak', 'Away_Team_Unbeaten_Streak', 'Elo_home',
            'Elo_away', 'Home_Wins_This_Far', 'Home_Draws_This_Far',
            'Home_Losses_This_Far', 'Away_Draws_This_Far',
            'Home_Wins_This_Far_at_Home', 'Home_Draws_This_Far_at_Home',
            'Home_Losses_This_Far_at_Home', 'Home_Draws_This_Far_Away',
            'Away_Wins_This_Far_at_Home', 'Away_Draws_This_Far_at_Home',
            'Away_Losses_This_Far_at_Home', 'Away_Wins_This_Far_Away',
            'Away_Draws_This_Far_Away', 'Capacity', 'Home_Yellow',
            'Away_Team_Yellows_This_Far', 'Away_Red', 'Home_Points_Per_Game',
            'Home_Goals_Per_Game', 'Home_Goals_Against_Per_Game',
            'Away_Points_Per_Game', 'Away_Goals_Per_Game',
            'Away_Goals_Against_Per_Game', 'Away_Cards_Per_Game', 'Pitch_Match',
            'League']

while using other metrics such as SelectKBest with a chi-squared function and RandomForest to sort feature importances produced similar lists. Using this one was purely due to it producing the best results on the testing set. 

Grid Search (particularly with the GridSearchCV package) was used to find the best model after separating the dataset into train and test sets and then scaling them with the StandardScaler package. 

In [None]:
models = [LinearRegression(),
    KNeighborsClassifier(n_neighbors=151),
    MLPClassifier(hidden_layer_sizes=(150, 100, 50), max_iter=1000,
                  activation='tanh', solver='adam', random_state=1,
                  learning_rate='adaptive'),
    MLPRegressor(activation='tanh', alpha=0.1,
       hidden_layer_sizes=(150, 100, 50),
       learning_rate='adaptive', solver='sgd',
       max_iter=1000),
    DecisionTreeClassifier(random_state=1,
    max_features="sqrt",
    max_depth=None),
    DecisionTreeRegressor(criterion='squared_error',
    max_depth=5),
    Lasso(alpha=0.00023),
    AdaBoostClassifier(learning_rate=1.0, n_estimators=10000),
    AdaBoostRegressor(learning_rate=0.01, n_estimators=10000),
    RandomForestClassifier(
        criterion='entropy', max_depth=128,
        max_features='log2', n_estimators=1024),
    RandomForestRegressor(criterion='poisson',
    max_depth=12, max_features='log2',
    n_estimators=256),
    GradientBoostingClassifier(criterion='friedman_mse',
                               learning_rate=0.2, loss='log_loss',
                               max_depth=8, max_features='sqrt',
                               min_samples_leaf=0.1,
                               min_samples_split=0.18,
                               n_estimators=10, subsample=1),
    GradientBoostingRegressor(criterion='friedman_mse',
    learning_rate=0.2, loss='squared_error',
    max_depth=8, max_features='log2',
    min_samples_leaf=0.1,
    min_samples_split=0.18,
    n_estimators=10, subsample=1),
    XGBClassifier(learning_rate=0.01, max_depth=6, n_estimators=324),
    XGBRegressor(learning_rate=0.05, max_depth=4, n_estimators=220),
    SGDClassifier(alpha=0.01, loss='log_loss', penalty='none'),
    SGDRegressor(alpha=0.01, loss='squared_error', penalty='none'),
    RidgeClassifier()
]

Above are the best hyper-parameters by score after roughly a day of training each model. By score, the best models were RandomForestClassifier and XGBClassifier.

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def scale_array(df):
    scaler = StandardScaler()
    scaler.fit(df)
    X_sc = scaler.transform(df)
    return X_sc

def accuracy(confusion_matrix):
    diagonal_sum = confusion_matrix.trace()
    sum_of_all_elements = confusion_matrix.sum()
    return diagonal_sum / sum_of_all_elements

y = data['Result'].values
X = data.drop(['Result', 'Date_New', 'Link'], inplace=False, axis=1)
X.League = X.League.astype('category').cat.codes
X_sc = scale_array(X[svm_cols])
X_train, X_test, y_train, y_test = train_test_split(X_sc, y, test_size=0.1)
model = RandomForestClassifier(
        criterion='entropy', max_depth=128,
        max_features='log2', n_estimators=1024)
model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy(cm))

KeyboardInterrupt: 

This gave an accuracy of around 0.53 which is passable. Afterwards, I iteratively tested removing older years from the dataset as they ae unlikely to be reflective of modern football. Supported by testing, I decided to remove the matches played before the year 2000 and briefly trialed removing the eerste divisie before simply rescraping the data as it was poorly scraped initially.