# Probabilty Estimation of Home and Away Teams for English Football Matches

When it comes to football betting there can be a lot of money to be made, bookmakers each year make hundreds of thousands if not millions on the public betting on football games. With that being said bookmakers use probability to gain advantage over the football games being bet on and this then reflects on the odds that the public get. If we can estimate the probabilities for the home teams and away teams winning from the data gathered, we can stand a greater chance of gaining an advantage. It must be said that with any kind of betting there is always an implied risk of losing, oftentimes football teams can go against the run of data that they have, for example, top of the premier league losing or dropping points to bottom of the premier league, it doesn't always happen but it does happen.

For this project the goal is to estimate the probabilities of winning for the home teams and away teams of football matches in the premier league, championship and league one of english football. The reasoning behind only obtaining probabilities of just winning is because the actual chance of two teams cancelling each other out for a draw is usually unlikely, it is more likely that one team underperforms, or one team overperforms, or both teams are just unlucky on the day. As we cant factor these outcomes in, predicting for a draw would be rather difficult.

So the idea for this project is to create two classification models, one for modelling the estimation of home team win probabilities and one for modelling the estimation of away team win probabilities (where win = class 1 and lose/draw = class 0). The highest probability teams for each game will be ranked by probability and difference from the team with the lower probability of winning from each game along with their expected value and bet return. These teams then can be bet on and also used in accumulation calculations for accumulator bets.

home team expected value formulation:

p(W|ht) * wa + (1 - p(W|ht)) * - ba

away team expected value formulation:

p(W|at) * wa + (1 - p(W|at)) * - ba

where W = win, ht = home team, at - away team, wa = win amount, ba = bet amount

The metric of choice for both models is fbeta with slightly more weight towards maximising recall and minimising false negatives. This is because minimising false negatives allows the models to be more risk accepting as decreasing false negatives (predicting no win when it actualy is win) will imply increasing false positives (predicting win when it actually is no win), therefore allowing more risk, this is down to choice.


## Library Imports

In [None]:
import datetime
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import joblib
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import Model, Sequential
from math import ceil
from pyearth import Earth
from tqdm import trange, tqdm_notebook
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from scipy.stats import anderson
from scipy.stats import normaltest
from scipy.stats import shapiro
from scipy.stats import chi2_contingency
from scipy.stats import boxcox
from scipy.spatial.distance import cdist
from scipy.optimize import differential_evolution
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.proportion import proportions_ztest
from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score
from sklearn.metrics import f1_score
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.kernel_approximation import RBFSampler
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import ComplementNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.calibration import calibration_curve
from sklearn.calibration import CalibratedClassifierCV
from geopy.geocoders import Nominatim
from geopy.distance import geodesic
from xgboost import XGBClassifier
from xgboost import XGBRFClassifier
from xgboost import plot_importance
from mlxtend.evaluate import paired_ttest_5x2cv
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from hpsklearn import HyperoptEstimator
from hyperopt import atpe, tpe, fmin, hp, STATUS_OK, space_eval, Trials
from hyperopt.pyll.base import scope
from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
from imblearn.under_sampling import OneSidedSelection
from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import InstanceHardnessThreshold
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import ADASYN
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import KMeansSMOTE
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SVMSMOTE
from imblearn.pipeline import Pipeline as IMBPipeline
from catboost import CatBoostClassifier, Pool, EFstrType
from lightgbm import LGBMClassifier
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import load_model

# show entire sataset
pd.set_option("display.max_rows", None, "display.max_columns", None)

## Functions

In [None]:
def similarity_imputation(data, impute_target, numeric_columns, missing_value='nan', n_neighbours=5, dec_places=0, regression=True):
    """
    Computes euclidean distance for feature similarity imputation

    if applying to multpile columns in the same dataframe the calculation will include previously imputed values

    params:
        data = dataframe that imputaion is being applied to
        impute_target = feature column with missing data
        numeric_columns  = feature columns in dataframe that have numeric values (nan values show numeric columns dtypes as 'object')
        missing_value = value to impute
        n_neighbours = number of closest neighbours to consider for imputation
        dec_places = decimal places to match target feature column when imputing
        regression = true for regression, false for classification

    returns:
        new dataframe with applied imputation
        list of indices indicating rows with missing values that couldn't be used for distance calculation
    """
    orig_data = data.copy()
    features = data.copy()
    features.drop(labels=impute_target, axis=1, inplace=True)
    # ohe
    idx = features.dtypes == 'object'
    cat_cols = list(idx[idx].index.values)
    num_cols = numeric_columns
    for num_col in num_cols:
        if num_col in cat_cols:
            cat_cols.remove(num_col)
    features_ohe = pd.get_dummies(features, columns=cat_cols)
    # get index of missing target rows
    if missing_value == 'nan':
        mv = data[impute_target].isnull()
        row_idx = mv[mv].index.values  # get index of true values
    else:
        mv = data[impute_target] == missing_value
        row_idx = mv[mv].index.values  # get index of true values
    # data that has no missing target value.. original data, dropped target data & ohe data
    orig_data_with_target = orig_data.drop(row_idx, axis=0)
    data_with_target = features.drop(row_idx, axis=0)
    data_with_target_ohe = features_ohe.drop(row_idx, axis=0)
    # data that has missing target value.. dropped target data & ohe data
    data_missing_target = features.iloc[row_idx, :]
    data_missing_target_ohe = features_ohe.iloc[row_idx, :]

    # main loop
    pred_data = data.copy()
    # rows with more than 2 nan values
    nan_index = []
    for i in trange(len(data_missing_target),  desc='Target imputation row'):
        y = data_missing_target.iloc[i, :].copy()
        # y.drop(labels = 'shoe_size', inplace = True) # remove target variable
        y_ohe = data_missing_target_ohe.iloc[i, :].copy()

        # check for nan values in other columns
        nan_y_count = 0
        y_nan_cols = []
        for k in range(len(y)):
            if missing_value == 'nan':
                if pd.isnull(y[k]):
                    nan_y_count += 1  # count missing values
                    y_nan_cols.append(y.index[k])
            else:
                if y[k] == missing_value:
                    nan_y_count += 1  # count missing values
                    # temp store column name of nan value
                    y_nan_cols.append(y.index[k])
        if nan_y_count >= 1:
            # if i == 0:
            #    print('First imputation row has 1 or more missing values!\n')
            if data_missing_target.iloc[i, :].name not in nan_index:
                nan_index.append(data_missing_target.iloc[i, :].name)
            continue
        # elif nan_y_count <= 2:
        #    y.drop(lables = y_nan_cols, inplace = True)
        distances = []
        for j in range(len(data_with_target)):
            x = data_with_target.iloc[j, :].copy()
            # remove target variable
            #x.drop(labels = 'shoe_size', inplace = True)
            x_ohe = data_with_target_ohe.iloc[j, :].copy()
            # check for nan values in other columns
            nan_x_count = 0
            x_nan_cols = []
            for l in range(len(x)):
                if missing_value == 'nan':
                    if pd.isnull(x[l]):  # if missing value is nan but not recognised
                        nan_x_count += 1  # count missing values
                        x_nan_cols.append(x.index[l])
                else:
                    if x[l] == missing_value:
                        nan_x_count += 1  # count missing values
                        # temp store column name of nan value
                        x_nan_cols.append(x.index[l])
            if nan_x_count >= 1:  # if more than 2 nan values index observation and move to next
                if data_with_target.iloc[j, :].name not in nan_index:
                    nan_index.append(data_with_target.iloc[j, :].name)
                continue
            # elif nan_x_count <= 2: # if x contains 2 or less nans, drop columns
            #    x.drop(lables = x_nan_cols, inplace = True)
            # ohe ########## needs to work when a value is categorical ##########
            #x = pd.get_dummies(x)
            #y = pd.get_dummies(y)
            euc_dist = euclidean_dist(x_ohe, y_ohe)
            distances.append(euc_dist)

        data_feats = orig_data_with_target.copy()  # df of rows target values
        for ind in nan_index:
            if ind in data_feats.index:
                data_feats.drop(labels=ind, axis=0, inplace=True)
        data_feats['distance'] = distances
        data_feats.sort_values(by='distance', ascending=True, inplace=True)
        # check number of neighbours against available data
        if n_neighbours > len(data_feats):
            if len(data_feats) % 2 == 0:  # if len(data_feats) is even then make n_neighbours odd
                k_nn = data_feats.iloc[0: len(data_feats)-1, :]
            else:
                k_nn = data_feats.iloc[0: len(data_feats), :]
            print(
                f'The n_neighbours parameter value of {n_neighbours} is greater than available data to perform a distance calculation\ntherefore n_neighbours is automatically set to the nearest available value: {len(k_nn)}\n')
        else:
            k_nn = data_feats.iloc[0: n_neighbours, :]
        if regression == True:
            # majority is the mean of nearest neighbours
            majority = np.round(k_nn[impute_target].mean(), dec_places)
            orig_data[impute_target][row_idx[i]] = majority
        else:  # classification
            # majority is the mode of the nearest neighbours
            if len(k_nn[impute_target].mode()) > 1:
                majority = k_nn[impute_target].mode().values[0]
            else:
                majority = k_nn[impute_target].mode().values[0]
            orig_data[impute_target][row_idx[i]] = majority
    return orig_data, sorted(nan_index)

In [None]:
def euclidean_dist(x_row, y_row):
    sq_difs = []
    for col in range(len(x_row)):
        sq_dif = (x_row[col] - y_row[col])**2
        sq_difs.append(sq_dif)

    sum_difs = np.sum(sq_difs)
    euc_dist = np.sqrt(sum_difs)
    return euc_dist

In [None]:
def create_ohe_team_df(data):
    """
    creates a one hot encoded dataframe of teams
    params:
        data = main data (combined league data)

    creates dataframe or returns error of mismatch of home/away teams
    """
    if sorted(data['HomeTeam'].value_counts().index) == sorted(data['AwayTeam'].value_counts().index):
        teams_enc = pd.get_dummies(
            sorted(data['HomeTeam'].value_counts().index))
        teams_enc.to_csv('ohe_teams.csv', encoding='utf-8', index=False)
    else:
        return 'Home teams and away teams do not match, check columns match.'

In [None]:
def team_ohe(team):
    """ 
    one hot encodes team

    params:
        team = string name of team

    returns either array of ohe values for selected team or error for spelling
    """
    ohe_df = pd.read_csv('ohe_teams.csv')
    if team in ohe_df.columns:
        return ohe_df[team].values
    else:
        return "Cannot find team, please check spelling including capitals"

In [None]:
def football_data_team_ohe(data):
    """
    one hot encodes football teams in data

    params:
        data = main football data

    returns new dataframe with encoded football teams
    """
    footy_data = data.copy()
    if sorted(data['HomeTeam'].value_counts().index) == sorted(data['AwayTeam'].value_counts().index):
        all_teams = sorted(footy_data['HomeTeam'].value_counts().index)
        # one hot encode home team
        all_home_teams = [f'home_{t}' for t in all_teams]
        home_teams = pd.DataFrame(0, index=range(
            0, len(footy_data)), columns=all_home_teams)

        for team in all_teams:
            if team in footy_data['HomeTeam'].values:
                mask = footy_data['HomeTeam'] == team
                mask_idx = mask[mask].index.values
                ohe_vals = team_ohe(team)
                for i in range(0, len(mask_idx)):
                    home_teams.loc[mask_idx[i]] = ohe_vals

        for col in home_teams:
            footy_data = footy_data.join(home_teams[col])
        # remove original home team data
        footy_data.drop(columns='HomeTeam', inplace=True)

        # one hot encode away team
        all_away_teams = [f'away_{t}' for t in all_teams]
        away_teams = pd.DataFrame(0, index=range(
            0, len(footy_data)), columns=all_away_teams)

        for team in all_teams:
            if team in footy_data['AwayTeam'].values:
                mask = footy_data['AwayTeam'] == team
                mask_idx = mask[mask].index.values
                ohe_vals = team_ohe(team)
                for i in range(0, len(mask_idx)):
                    away_teams.loc[mask_idx[i]] = ohe_vals

        for col in away_teams:
            footy_data = footy_data.join(away_teams[col])
        # remove original away team data
        footy_data.drop(columns='AwayTeam', inplace=True)
        return footy_data
    else:
        return 'Mismatch between home and away teams.'

In [None]:
def load_train_features():
    features = pd.read_csv('combined_leagues_train features.csv')
    return features

In [None]:
def load_transformed_train_data():
    ''' load train features with transformed continuous data only '''
    train_features = load_train_features()
    # transformed features
    box_cox = ['AHTSOT5PG', 'AATSOT5PG', 'AHTSOT5PHG',
               'AATSOT5PAG', 'AHTP5PG', 'AHTGS_SOT5PHG_ratio']
    quant = ['AHTGS5PG', 'AATGS5PG', 'AHTGC5PG', 'AATGC5PG', 'AHTGS5PHG', 'AATGS5PAG', 'AHTGC5PHG', 'AATGC5PAG', 'AHTGS_SOT5PG_ratio',
             'AATGS_SOT5PG_ratio', 'AATGS_SOT5PAG_ratio', 'AHTGD5PG', 'AATGD5PG', 'AHTGD5PHG', 'AATGD5PAG', 'HA_AHTGS5PG_diff',
             'HA_ATP5PG_diff', 'AHT_GS_P5PG_ratio', 'AAT_GS_P5PG_ratio', 'AwayTeamDist', 'AwayCapacityDiff_bin', 'AwayTeamDist_bin',
             'bxcx_AATGS5PG', 'bxcx_AHTGS5PG', 'bxcx_AHTGC5PG', 'bxcx_AHTSOT5PG', 'bxcx_AATSOT5PG', 'bxcx_AHT_GS_P5PG_ratio',
             'bxcx_AAT_GS_P5PG_ratio', 'bxcx_AATGC5PG']
    train_features.drop(columns=box_cox, inplace=True)
    train_features.drop(columns=quant, inplace=True)
    return train_features

In [None]:
def load_test_features():
    features = pd.read_csv('Combined_leagues_test_features.csv')
    return features

In [None]:
def load_home_targets():
    """ returns home win targets... train_target, test_target """
    train_target = pd.read_csv('Combined_leagues_home_train_target.csv')
    test_target = pd.read_csv('Combined_leagues_home_test_target.csv')
    return train_target, test_target

In [None]:
def load_away_targets():
    """ returns away win targets... train_target, test_target """
    train_target = pd.read_csv('Combined_leagues_away_train_target.csv')
    test_target = pd.read_csv('Combined_leagues_away_test_target.csv')
    return train_target, test_target

In [None]:
def load_home_win_train_df():
    df = load_train_features()
    target, _ = load_home_targets()
    df['HomeFTR'] = target
    return df

In [None]:
def load_home_train_df_transformed():
    df = load_transformed_train_data()
    target, _ = load_home_targets()
    df['HomeFTR'] = target
    return df

In [None]:
def load_away_win_train_df():
    df = load_train_features()
    target, _ = load_away_targets()
    df['AwayFTR'] = target
    return df

In [None]:
def load_away_train_df_transformed():
    df = load_transformed_train_data()
    target, _ = load_away_targets()
    df['AwayFTR'] = target
    return df

In [None]:
def save_train_features(train_features):
    """ saves train_features as csv file """
    train_features.to_csv(
        "combined_leagues_train features.csv", encoding='utf-8', index=False)

In [None]:
def save_test_features(test_features):
    """ saves test_features as csv file """
    test_features.to_csv("Combined_leagues_test_features.csv",
                         encoding='utf-8', index=False)

In [None]:
def cdf(sample, x):
    count = 0
    for val in sample:
        if val <= x:
            count += 1

    prob = count / len(sample)
    return prob

In [None]:
def phi(contingency_table):
    chi2 = chi2_contingency(contingency_table)[0]
    n = contingency_table.sum()
    return np.sqrt(chi2 / n)

In [None]:
def get_football_ground(team):
    """ 
    returns football ground and capacity for selected team
    football ground used in conjunction with geopy for geo location

    params:
        team = string of team

    returns tuple (ground name, capacity)
    """
    football_ground_dict = {
        'AFC Wimbledon': ('Wimbledon Plough Lane', 9300),
        'Accrington': ('Accrington Crown Ground', 5450),
        'Arsenal': ('London Emirates Stadium', 60260),
        'Aston Villa': ('Aston Villa Villa Park', 42749),
        'Barnsley': ('Barnsley Oakwell Stadium', 23287),
        'Birmingham': ('Birmingham St Andrews Trillion Trophy Stadium', 29409),
        'Blackburn': ('Blackburn Ewood Park', 31367),
        'Blackpool': ('Blackpool Bloomfield Road', 16616),
        'Bolton': ('Bolton University of Bolton Stadium', 28723),
        'Bournemouth': ('Bournemouth Vitality Stadium', 11329),
        'Bradford': ('Bradford Utilita Energy Stadium', 25136),
        'Brentford': ('Brentford Community Stadium', 17250),
        'Brighton': ('Brighton American Express Community Stadium', 30666),
        'Bristol City': ('Bristol Ashton Gate Stadium', 27000),
        'Bristol Rvs': ('Bristol Memorial Stadium', 12296),
        'Burnley': ('Burnley Turf Moor', 21944),
        'Burton': ('Burton Pirelli Stadium', 6912),
        'Bury': ('Bury Gigg Lane', 11840),
        'Cambridge': ('Cambridge Abbey Stadium', 8127),
        'Cardiff': ('Cardiff City Stadium', 33280),
        'Carlisle': ('Carlisle Brunton Park', 18202),
        'Charlton': ('Charlton The Valley', 27111),
        'Chelsea': ('Chelsea Stamford Bridge', 40834),
        'Cheltenham': ('Cheltenham Jonny-Rocks Stadium', 7066),
        'Chesterfield': ('chesterfield Proact Stadium', 10600),
        'Colchester': ('Colchester JobServe Community Stadium', 10105),
        'Coventry': ('Coventry The Ricoh Arena', 32609),
        'Crawley Town': ('Crawley Broadfield Stadium', 6134),
        'Crewe': ('Crewe Alexandra Stadium', 10153),
        'Crystal Palace': ('London Selhurst Park', 25486),
        'Dag and Red': ('Dagenham Victoria Road', 6078),
        'Derby': ('Derby Pride Park Stadium', 33597),
        'Doncaster': ('Doncaster Keepmoat Stadium', 15231),
        'Everton': ('Liverpool Goodison Park', 39414),
        'Exeter': ('Exeter St. James Park', 8541),
        'Fleetwood Town': ('Fleetwood Highbury Stadium', 5327),
        'Fulham': ('Fulham Craven Cottage', 19359),
        'Gillingham': ('Gillingham Priestfield Stadium', 11582),
        'Grimsby': ('Grimsby Blundell Park', 9031),
        'Hartlepool': ('Hartlepool Victoria Park', 7865),
        'Hereford': ('Hereford Edgar Street', 5213),
        'Huddersfield': ('Huddersfield Kirklees Stadium', 24500),
        'Hull': ('Hull KCom Stadium', 25586),
        'Ipswich': ('Ipswich Portman Road Stadium', 30311),
        'Leeds': ('Leeds Elland Road', 37792),
        'Leicester': ('Leicester King Power Stadium', 32312),
        'Leyton Orient': ('Leyton The Breyer Group Stadium', 9271),
        'Lincoln': ('Lincoln LNER Stadium', 10120),
        'Liverpool': ('Liverpool Anfield', 53394),
        'Luton': ('Luton Kenilworth Road', 10356),  # possible new stadium
        'Man City': ('Manchester Etihad Stadium', 55017),
        'Man United': ('Manchester Old Trafford', 74140),
        'Mansfield': ('Mansfield Field Mill', 9186),
        'Middlesbrough': ('Middlesbrough The Riverside Stadium', 34742),
        'Millwall': ('Millwall The Den', 20146),
        'Milton Keynes Dons': ('Milton Keynes Stadium MK', 30500),
        'Newcastle': ('Newcastle St. James Park', 52305),
        'Northampton': ('Northampton Sixfields Stadium', 7798),
        'Norwich': ('Norwich Carrow Road Stadium', 27244),
        "Nott'm Forest": ('Nottingham The City Ground', 30446),
        'Notts County': ('Nottingham Meadow Lane', 19841),
        'Oldham': ('Oldham Boundary Park', 13513),
        'Oxford': ('Oxford Kassam Stadium', 12400),
        'Peterboro': ('Peterborough Weston Homes Stadium', 15314),
        'Plymouth': ('Plymouth Home Park', 17904),
        'Port Vale': ('Port Vale Vale Park', 19052),
        'Portsmouth': ('Portsmouth Fratton Park', 20620),
        'Preston': ('Preston Deepdale Stadium', 23404),
        'QPR': ('Hammersmith Loftus Road', 18439),
        'Reading': ('Reading Madejski Stadium', 24161),
        'Rochdale': ('Rochdale Spotland Stadium', 15000),
        'Rotherham': ('Rotherham AESSEAL New York Stadium', 12021),
        'Rushden & D': ('Northampton Nene Park', 6441),
        'Scunthorpe': ('Scunthorpe Glanford Park', 9088),
        'Sheffield United': ('Sheffield Bramall Lane', 32050),
        'Sheffield Weds': ('Sheffield Hillsborough Stadium', 34854),
        'Shrewsbury': ('Shrewsbury New Meadow', 9875),
        'Southampton': ('Southampton St. Marys Stadium', 32505),
        'Southend': ('Southend-on-Sea Roots Hall Stadium', 12392),
        'Stevenage': ('Stevenage Lamex Stadium', 7800),
        'Stockport': ('Stockport Edgeley Park', 10841),
        'Stoke': ('Stoke Bet365 Stadium', 30089),
        'Sunderland': ('Sunderland Stadium of Light', 49000),
        'Swansea': ('Swansea Liberty Stadium', 21088),
        'Swindon': ('Swindon The County Ground', 15728),
        'Torquay': ('Torquay Plainmoor', 6500),
        'Tottenham': ('Tottenham White Hart Lane', 62303),
        'Tranmere': ('Tranmere Prenton Park', 16567),
        'Walsall': ('Walsall Bescot Stadium', 11300),
        'Watford': ('Watford Vicarage Road', 21577),
        'West Brom': ('West Bromwich The Hawthorns', 26688),
        'West Ham': ('West Ham The London Stadium', 60000),
        'Wigan': ('Wigan DW Stadium', 25133),
        'Wimbledon': ('Wimbledon Plough Lane', 9300),
        'Wolves': ('Wolverhampton Molineux Stadium', 31700),
        'Wrexham': ('Wrexham Glyndŵr University Racecourse Stadium', 10771),
        'Wycombe': ('Wycombe Adams Park', 9448),
        'Yeovil': ('Yeovil Huish Park', 9565)
    }
    return football_ground_dict.get(team)

In [None]:
def fbeta(y_true, y_pred):
    return fbeta_score(y_true, y_pred, beta=1.2, average='weighted')

In [None]:
def r2(y_true, y_pred):
    return r2_score(y_true, y_pred)

In [None]:
def explained_var(y_true, y_pred):
    return explained_variance_score(y_true, y_pred)

In [None]:
def evaluate_model(features, target, model, k_splits=5):
    cv = RepeatedStratifiedKFold(
        n_splits=k_splits, n_repeats=2, random_state=1)
    # evaluation scoring metric
    metric = make_scorer(fbeta)
    scores = cross_val_score(model, features, target,
                             scoring=metric, cv=cv, n_jobs=-1)
    return scores

In [None]:
def cramers_corrected_stat(contingency_table):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = chi2_contingency(contingency_table)[0]
    n = contingency_table.sum()
    phi2 = chi2/n
    r, k = contingency_table.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

In [None]:
def load_home_train_features_with_drop():
    """ loads home training features with uncorrelated/unassociated features dropped and without continuous transformations """
    home_df = load_home_win_train_df()
    non_sig = ['HF', 'AF', 'AY', 'HR', 'AR', 'year', 'AHTGS_SOT5PG_ratio', 'AHTGS_SOT5PHG_ratio', 'AATGS5PG_LOWoutlier',
               'AHTSOT5PG_LOWoutlier', 'AATSOT5PG_LOWoutlier', 'AATSOT5PAG_LOWoutlier', 'AHTGS_SOT5PG_ratio_UPoutlier',
               'AATGS_SOT5PG_ratio_UPoutlier', 'AHTGS_SOT5PHG_ratio_UPoutlier', 'AATGS_SOT5PAG_ratio_UPoutlier',
               'AHTGD5PG_LOWoutlier', 'AATGD5PG_LOWoutlier', 'AwayTeamDist', 'AwayTeamDist_bin', 'Local_Derby',
               'AATGS5PG_upqrt_AHTGC5PG_lowqrt']
    #outlier_feats = [col for col in home_df.columns if 'outlier' in col]
    # outlier_feat_keep = ['AHTGC5PG_LOWoutlier','AATGC5PG_UPoutlier','AHTGS5PHG_UPoutlier','AATGS5PAG_UPoutlier',
    #                     'AATSOT5PG_LOWoutlier','AHTGD5PG_UPoutlier','AATGD5PG_UPoutlier','AATGD5PAG_LOWoutlier']
    #drop_outlier_feats = [col for col in outlier_feats if col not in outlier_feat_keep  ]
    others_to_drop = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC', 'HY',
                      'AwayCapacityDiff', 'HomeFTR']  # ,'AwayCapacityDiff_bin','Dist>=100','cluster_0','cluster_1',
    #                'cluster_2','bxcx_AHTGS5PG','bxcx_AHTGC5PG','bxcx_AATGC5PG','HA_AHTGS5PG_diff_lowqrt','AATGC5PG_upqrt',
    #                  'AATGC5PG_lowqrt','HomeFTR']
    #drop_others = [col for col in others_to_drop if col not in non_sig or col not in drop_outlier_feats]
    # all features to drop
    transformed = [col for col in home_df.columns if 'TRANSFORM' in col]
    combined_feats = non_sig + others_to_drop + transformed
    # drop features
    drop_df = home_df.drop(columns=combined_feats)

    return drop_df

In [None]:
def load_home_train_features_with_drop_transformed():
    """ loads home training features with uncorrelated/unassociated features dropped and with continuous transformations """
    home_df = load_home_train_df_transformed()
    non_sig = ['HF', 'AF', 'AY', 'HR', 'AR', 'year', 'AHTGS_SOT5PG_ratio_quantileTRANSFORM', 'AHTGS_SOT5PHG_ratio_bxcx_pwrTRANSFORM',
               'AATGS5PG_LOWoutlier', 'AHTSOT5PG_LOWoutlier', 'AATSOT5PG_LOWoutlier', 'AATSOT5PAG_LOWoutlier',
               'AHTGS_SOT5PG_ratio_UPoutlier', 'AATGS_SOT5PG_ratio_UPoutlier', 'AHTGS_SOT5PHG_ratio_UPoutlier',
               'AATGS_SOT5PAG_ratio_UPoutlier', 'AHTGD5PG_LOWoutlier', 'AATGD5PG_LOWoutlier', 'AwayTeamDist_quantileTRANSFORM',
               'AwayTeamDist_bin_quantileTRANSFORM', 'Local_Derby', 'AATGS5PG_upqrt_AHTGC5PG_lowqrt']
    #outlier_feats = [col for col in home_df.columns if 'outlier' in col]
    # outlier_feat_keep = ['AHTGC5PG_LOWoutlier','AATGC5PG_UPoutlier','AHTGS5PHG_UPoutlier','AATGS5PAG_UPoutlier',
    #                     'AATSOT5PG_LOWoutlier','AHTGD5PG_UPoutlier','AATGD5PG_UPoutlier','AATGD5PAG_LOWoutlier']
    #drop_outlier_feats = [col for col in outlier_feats if col not in outlier_feat_keep  ]
    others_to_drop = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC', 'HY',
                      'AwayCapacityDiff', 'HomeFTR']  # ,'AwayCapacityDiff_bin','Dist>=100','cluster_0','cluster_1',
    #                'cluster_2','bxcx_AHTGS5PG','bxcx_AHTGC5PG','bxcx_AATGC5PG','HA_AHTGS5PG_diff_lowqrt','AATGC5PG_upqrt',
    #                  'AATGC5PG_lowqrt','HomeFTR']
    #drop_others = [col for col in others_to_drop if col not in non_sig or col not in drop_outlier_feats]
    # all features to drop
    combined_feats = non_sig + others_to_drop
    # drop features
    drop_df = home_df.drop(columns=combined_feats)

    return drop_df

In [None]:
def load_away_train_features_with_drop():
    """ loads away training features with uncorrelated/unassociated features dropped and without continuous transformations"""
    away_df = load_away_win_train_df()
    non_sig = ['HF', 'AF', 'HC', 'AY', 'HR', 'AR', 'AHTGS_SOT5PG_ratio', 'AHTGS_SOT5PHG_ratio', 'AATGS5PG_LOWoutlier',
               'AHTGC5PG_LOWoutlier', 'AATGC5PG_UPoutlier', 'AATGC5PAG_UPoutlier', 'AHTSOT5PG_LOWoutlier',
               'AATSOT5PG_LOWoutlier', 'AATSOT5PAG_LOWoutlier', 'AHTGS_SOT5PG_ratio_UPoutlier', 'AATGS_SOT5PG_ratio_UPoutlier',
               'AHTGS_SOT5PHG_ratio_UPoutlier', 'AATGS_SOT5PAG_ratio_UPoutlier', 'AHTGD5PG_LOWoutlier', 'AATGD5PG_LOWoutlier',
               'AATGD5PAG_LOWoutlier', 'AwayTeamDist', 'AwayTeamDist_bin', 'Local_Derby', 'Dist>=100',
               'AHTGS5PG_upqrt_AATGC5PG_lowqrt', 'AATGS5PG_upqrt_AHTGC5PG_lowqrt']
    #outlier_feats = [col for col in away_df.columns if 'outlier' in col]
    # outlier_feat_keep = ['AATGS5PG_UPoutlier','AHTGC5PG_UPoutlier','AATGS5PAG_UPoutlier','AHTGC5PHG_UPoutlier','AATGC5PAG_UPoutlier',
    #                    'AATSOT5PG_UPoutlier','AHTSOT5PHG_UPoutlier','AATCOT5PAG_UPoutlier','AHTGD5PHG_LOWoutlier']
    #drop_outlier_feats = [col for col in outlier_feats if col not in outlier_feat_keep  ]
    others_to_drop = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'AC', 'HY',
                      'AwayCapacityDiff', 'AwayFTR']  # ,'AwayCapacityDiff_bin','Dist>=100','cluster_2',
    #                 'bxcx_AHTGC5PG','bxcx_AATGC5PG','bxcx_AHTSOT5PG','AHTGC5PG_upqrt','AHTGC5PG_lowqrt','AATGC5PG_upqrt',
    #                  'AATGC5PG_lowqrt','AwayFTR']
    #drop_others = [col for col in others_to_drop if col not in non_sig or col not in drop_outlier_feats]
    # all features to drop
    transformed = [col for col in away_df.columns if 'TRANSFORM' in col]
    combined_feats = non_sig + others_to_drop + transformed
    # drop features
    drop_df = away_df.drop(columns=combined_feats)

    return drop_df

In [None]:
def load_away_train_features_with_drop_transformed():
    """ loads away training features with uncorrelated/unassociated features dropped and with continuous transformations"""
    away_df = load_away_train_df_transformed()
    non_sig = ['HF', 'AF', 'HC', 'AY', 'HR', 'AR', 'AHTGS_SOT5PG_ratio_quantileTRANSFORM', 'AHTGS_SOT5PHG_ratio_bxcx_pwrTRANSFORM',
               'AATGS5PG_LOWoutlier', 'AHTGC5PG_LOWoutlier', 'AATGC5PG_UPoutlier', 'AATGC5PAG_UPoutlier', 'AHTSOT5PG_LOWoutlier',
               'AATSOT5PG_LOWoutlier', 'AATSOT5PAG_LOWoutlier', 'AHTGS_SOT5PG_ratio_UPoutlier', 'AATGS_SOT5PG_ratio_UPoutlier',
               'AHTGS_SOT5PHG_ratio_UPoutlier', 'AATGS_SOT5PAG_ratio_UPoutlier', 'AHTGD5PG_LOWoutlier', 'AATGD5PG_LOWoutlier',
               'AATGD5PAG_LOWoutlier', 'AwayTeamDist_quantileTRANSFORM', 'AwayTeamDist_bin_quantileTRANSFORM', 'Local_Derby',
               'Dist>=100', 'AHTGS5PG_upqrt_AATGC5PG_lowqrt', 'AATGS5PG_upqrt_AHTGC5PG_lowqrt']
    #outlier_feats = [col for col in away_df.columns if 'outlier' in col]
    # outlier_feat_keep = ['AATGS5PG_UPoutlier','AHTGC5PG_UPoutlier','AATGS5PAG_UPoutlier','AHTGC5PHG_UPoutlier','AATGC5PAG_UPoutlier',
    #                    'AATSOT5PG_UPoutlier','AHTSOT5PHG_UPoutlier','AATCOT5PAG_UPoutlier','AHTGD5PHG_LOWoutlier']
    #drop_outlier_feats = [col for col in outlier_feats if col not in outlier_feat_keep  ]
    others_to_drop = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'AC', 'HY',
                      'AwayCapacityDiff', 'AwayFTR']  # ,'AwayCapacityDiff_bin','Dist>=100','cluster_2',
    #                 'bxcx_AHTGC5PG','bxcx_AATGC5PG','bxcx_AHTSOT5PG','AHTGC5PG_upqrt','AHTGC5PG_lowqrt','AATGC5PG_upqrt',
    #                  'AATGC5PG_lowqrt','AwayFTR']
    #drop_others = [col for col in others_to_drop if col not in non_sig or col not in drop_outlier_feats]
    # all features to drop
    #transformed = [col for col in away_df.columns if 'TRANSFORM' in col]
    combined_feats = non_sig + others_to_drop
    # drop features
    drop_df = away_df.drop(columns=combined_feats)

    return drop_df

In [None]:
def sequential_forward_feature_selection(model, dataframe, sampling=None, combined_sampling=None):
    """ Classification """

    feats, targ = load_data(dataframe)
    feats, targ = shuffle(feats, targ, random_state=42)
    remaining_features = list(feats.columns)
    feature_list = []
    best_subfeat_scores = None
    best_score = None
    for i in trange(1, len(feats.columns) + 1, desc='Feature round...'):
        #print(f'\nremaining features: {remaining_features}\n')
        top_round_feat = None
        top_round_scores = None
        for rf in remaining_features:
            # select features to score & one hot encode
            if len(feature_list) == 0:
                sub_feats = feats[rf]
                if sub_feats.dtype == 'O':
                    sub_feats = pd.get_dummies(sub_feats)
                else:
                    sub_feats = np.array(sub_feats).reshape(-1, 1)
            else:
                sub_feats = pd.concat([feats[feature_list], feats[rf]], axis=1)
                if 'O' in sub_feats.dtypes.tolist():
                    sub_feats = pd.get_dummies(sub_feats)

            # initiate pipeline - scale, sampling - if needed, model
            if sampling is not None and combined_sampling is None:
                pipe = Pipeline(
                    steps=[
                        ('scaler', StandardScaler(with_mean=False)),
                        ('sampling', sampling),
                        ('model', model)
                    ]
                )
            elif sampling is not None and combined_sampling is not None:
                pipe = Pipeline(
                    steps=[
                        ('scaler', StandardScaler(with_mean=False)),
                        ('sampling', sampling),
                        ('combo sampling', combined_sampling),
                        ('model', model)
                    ]
                )
            else:
                pipe = Pipeline(
                    steps=[
                        ('scaler', StandardScaler(with_mean=False)),
                        ('model', model)
                    ]
                )
            # evaluate sub_feats with model
            scores = evaluate_model(sub_feats, targ, pipe)
            current_score = np.mean(scores)
            # assert (current_score == np.nan), f'Model evaluation: {current_score}\nFunction will not work correctly if a subset feature score is nan, check terminal or increase verbosity of cv to find where the problem is'  ######
            # keep best score, feature & scores
            if best_score is not None:
                if current_score > best_score:
                    best_score = current_score
                    top_round_feat = rf
                    top_round_scores = scores
                else:
                    continue
            else:
                best_score = current_score
                top_round_feat = rf
                top_round_scores = scores

            #print(f'top round feat: {top_round_feat}')
            #print(f'best score: {best_score}')

        # update best scores
        if top_round_scores is not None:
            best_subfeat_scores = top_round_scores
        # update best features
        if top_round_feat is not None:
            feature_list.append(top_round_feat)
            # remove best round feature from next round
            remaining_features.remove(top_round_feat)
        else:
            # if no feature improved the model we break and finish
            print('Optimal features reached......\n\n')
            break

    # print model, best features, score and standard deviation
    print(f'{model}\n\nBest Features:\n{feature_list}\n\nScore: {np.mean(best_subfeat_scores)}        Std: {np.std(best_subfeat_scores)}')

In [None]:
def get_ml_models():
    """ models for testing """
    models, names = list(), list()

    models.append(LogisticRegression(solver='liblinear',
                                     class_weight='balanced', random_state=42))
    names.append('LR')

    #models.append(NuSVC(random_state = 42))
    # names.append('NuSVC')

    models.append(KNeighborsClassifier(n_neighbors=5, n_jobs=-1))
    names.append('KNN')

    models.append(DecisionTreeClassifier(class_weight='balanced',
                                         max_features='sqrt', random_state=42))
    names.append('DT')

    models.append(RandomForestClassifier(n_estimators=200,
                                         class_weight='balanced', max_features='sqrt', n_jobs=-1, random_state=42))
    names.append('RF')

    models.append(GradientBoostingClassifier(
        n_estimators=200, max_features='sqrt', random_state=42))
    names.append('GB')

    models.append(AdaBoostClassifier(n_estimators=100, random_state=42))
    names.append('AB')

    models.append(GaussianNB())
    names.append('GNB')

    models.append(ComplementNB())
    names.append('CNB')

    models.append(XGBClassifier(n_estimators=200, max_depth=3,
                                use_label_encoder=False, random_state=42))
    names.append('XGB')

    models.append(XGBRFClassifier(
        n_estimators=200, max_depth=3, random_state=42))
    names.append('XGBRF')

    return models, names

In [None]:
def get_us_models():
    """ undersamplng models """
    models, names = list(), list()

    models.append(TomekLinks())
    names.append('TL')

    models.append(EditedNearestNeighbours())
    names.append('ENN')

    models.append(RepeatedEditedNearestNeighbours())
    names.append('RENN')

    models.append(OneSidedSelection())
    names.append('OSS')

    models.append(NeighbourhoodCleaningRule())
    names.append('NCR')

    models.append(InstanceHardnessThreshold())
    names.append('IHT')

    models.append(SMOTEENN(enn=EditedNearestNeighbours(
        sampling_strategy='majority')))
    names.append('S(ENN)')

    models.append(NearMiss(version=1, n_neighbors=3))
    names.append('NM1')

    models.append(NearMiss(version=2, n_neighbors=3))
    names.append('NM2')

    models.append(NearMiss(version=3, n_neighbors_ver3=3))
    names.append('NM3')

    return models, names

In [None]:
def home_gb_setup():
    gb_model = GradientBoostingClassifier(random_state=42)
    home_train_features = load_home_train_features_with_drop()
    gb_feats = home_train_features[['AHTGC5PHG', 'AHTSOT5PG', 'AATSOT5PG', 'AHTSOT5PHG', 'AATGS_SOT5PAG_ratio', 'AATGD5PAG',
                                    'season_month', 'HA_ATP5PG_diff', 'AwayCapacityDiff_bin', 'HT_PrevSeasonPos_inv',
                                    'AT_PrevSeasonPos_inv', 'bxcx_AHTSOT5PG', 'bxcx_AATGC5PG']]
    gb_svms = SVMSMOTE(random_state=1)
    gb_name = 'GB'
    return gb_model, gb_feats, gb_svms, gb_name

In [None]:
def home_gnb_setup():
    gnb_model = GaussianNB()
    home_train_features = load_home_train_features_with_drop_transformed()
    gnb_feats = home_train_features[['AwayCapacityDiff_bin_quantileTRANSFORM', 'HT_PrevSeasonPos_inv',
                                     'HA_ATP5PG_diff_quantileTRANSFORM', 'AATSOT5PAG_bxcx_pwrTRANSFORM', 'season_month_sin',
                                     'AHTGC5PG_UPoutlier', 'AT_PrevSeasonPos_inv', 'season_month']]
    gnb_pca = PCA(n_components=7, whiten=True)
    gnb_name = 'GNB'
    return gnb_model, gnb_feats, gnb_pca, gnb_name

In [None]:
def away_ab_setup():
    ab_model = AdaBoostClassifier(n_estimators=100, random_state=42)
    away_train_features = load_away_train_features_with_drop()
    ab_feats = away_train_features[['AHTGC5PHG', 'AATGC5PAG', 'AATSOT5PG', 'AHTP5PHG', 'month', 'year', 'AATGS_SOT5PG_ratio',
                                    'AATGS_SOT5PAG_ratio', 'AHTGD5PG', 'AATGD5PG', 'HA_AHTGS5PG_diff', 'AHT_GS_P5PG_ratio',
                                    'AAT_GS_P5PG_ratio', 'AwayCapacityDiff_bin', 'HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv',
                                    'bxcx_AHTGS5PG', 'bxcx_AHTGC5PG', 'bxcx_AHTSOT5PG', 'bxcx_AATSOT5PG']]
    ab_svms = SVMSMOTE(random_state=1)
    ab_name = 'AB'
    return ab_model, ab_feats, ab_svms, ab_name

In [None]:
def away_lgbm_setup():
    lgbm_model = LGBMClassifier(n_estimators=200,
                                learning_rate=0.1,
                                objective='binary',
                                max_depth=3,
                                n_jobs=-1,
                                random_state=42)
    away_train_features = load_away_train_features_with_drop()
    lgbm_feats = away_train_features[['AHTGS5PG', 'AHTGC5PG', 'AHTSOT5PHG', 'AATSOT5PAG', 'AHTP5PHG', 'AHTGD5PG',
                                      'HA_AHTGS5PG_diff', 'HA_ATP5PG_diff', 'AHT_GS_P5PG_ratio', 'AwayCapacityDiff_bin',
                                      'HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv']]
    lgbm_name = 'LGBM'
    return lgbm_model, lgbm_feats, lgbm_name

In [None]:
def get_os_models():
    """ oversampling models """
    models, names = list(), list()

    models.append(SMOTE())
    names.append('S')

    models.append(BorderlineSMOTE())
    names.append('BS')

    models.append(SVMSMOTE())
    names.append('SVMS')

    # models.append(ADASYN(sampling_strategy='minority'))
    # names.append('AS')

    # models.append(KMeansSMOTE())
    # names.append('KMS')

    return models, names

In [None]:
# weighted average functions

def ensemble_predictions(models, weights, test_features):
    """
    make ensemble predictions, returns prediction
    parameters:
        models = list of models/pipes to include in ensemble
        weights = array of weights to use for predictions
        test_features = list of test features for each model included.. must be in same order

    returns:
        predicted classes, predicted probabilities for success, i.e label 1
    """
    # model predictions.. probs
    yhat_probs = [models[i].predict_proba(test_features)[:, 1] for i in range(
        len(models))]  # probabilities for class 1
    yhat_probs = np.array(yhat_probs)
    # weighted sum across models.. probabilities of the weighted models
    # summed products of probabilities & weights
    weighted_probs = np.tensordot(yhat_probs, weights, axes=((0), (0)))
    # get weighted predicted classes
    # round probabilities to classes.. 0 or 1
    yhat_classes = np.round(weighted_probs)
    # turn floats values to intergers
    yhat_classes = np.array([int(x) for x in yhat_classes])
    return yhat_classes, weighted_probs


def evaluate_ensemble(models, weights, test_features, test_target):
    """returns predicted classes, success probabilities and metric evaluation for models"""
    # predictions
    yhat, probs = ensemble_predictions(models, weights, test_features)
    # calculate fbeta loss
    return yhat, probs, fbeta(test_target, yhat)


def normalise(weights):
    """normalises weight vector to have unit norm, returns normalised weight vector"""
    # l1 vector norm
    result = np.linalg.norm(weights, 1)
    # check for vector of all 0's
    if result == 0.0:
        return weights
    return weights / result


def loss_function(weights, models, test_features, test_target):
    """loss function for optimisation, designed to be minimised"""
    # normalise weights
    norm_weights = normalise(weights)
    # calculate error
    _, _, score = evaluate_ensemble(
        models, norm_weights, test_features, test_target)
    return 1.0 - score

In [None]:
def load_cluster_features():
    return pd.read_csv('cluster_features.csv')

In [None]:
def load_cluster1_target():
    return pd.read_csv('cluster_1_target.csv')

In [None]:
def load_cluster3_target():
    return pd.read_csv('cluster_3_target.csv')

In [None]:
def load_cluster_test_feats():
    return pd.read_csv('cluster_test_features.csv')

In [None]:
def load_cluster_models():
    """ models for testing """
    models, names = list(), list()

    models.append(KNeighborsClassifier(n_neighbors=5, n_jobs=-1))
    names.append('KNN')

    models.append(DecisionTreeClassifier(class_weight='balanced',
                                         max_features='sqrt', random_state=42))
    names.append('DT')

    models.append(RandomForestClassifier(n_estimators=200,
                                         class_weight='balanced', max_features='sqrt', n_jobs=-1, random_state=42))
    names.append('RF')

    models.append(GradientBoostingClassifier(n_estimators=200,
                                             learning_rate=0.1, max_features='sqrt', random_state=42))
    names.append('GB')

    models.append(AdaBoostClassifier(n_estimators=100, random_state=42))
    names.append('AB')

    models.append(XGBClassifier(n_estimators=200, learning_rate=0.1,
                                max_depth=3, use_label_encoder=False, random_state=42))
    names.append('XGB')

    models.append(CatBoostClassifier(n_estimators=500, learning_rate=0.1,
                                     max_depth=6, auto_class_weights='Balanced', random_seed=42))
    names.append('CBC')

    models.append(LGBMClassifier(objective='binary', n_estimators=200,
                                 learning_rate=0.1, max_depth=3, is_unbalance=True, random_state=42))
    names.append('LGBM')

    return models, names

In [None]:
class OHETeamTransformer():
    """ 
    custom transformer that one hot encodes football teams
    """

    def __init__(self):
        pass

    def team_ohe(self, team):
        """ 
        one hot encodes team

        params:
            team = string name of team

        returns either array of ohe values for selected team or error for spelling
        """
        ohe_df = pd.read_csv('ohe_teams.csv')
        if team in ohe_df.columns:
            return ohe_df[team].values
        else:
            return "Cannot find team, please check spelling including capitals"

    def football_data_team_ohe(self, data):
        """
        one hot encodes football teams in data

        params:
            data = main football data

        returns new dataframe with encoded football teams
        """
        footy_data = data.copy()
        if sorted(data['HomeTeam'].value_counts().index) == sorted(data['AwayTeam'].value_counts().index):
            all_teams = sorted(footy_data['HomeTeam'].value_counts().index)
            # one hot encode home team
            all_home_teams = [f'home_{t}' for t in all_teams]
            home_teams = pd.DataFrame(0, index=range(
                0, len(footy_data)), columns=all_home_teams)

            for team in all_teams:
                if team in footy_data['HomeTeam'].values:
                    mask = footy_data['HomeTeam'] == team
                    mask_idx = mask[mask].index.values
                    ohe_vals = team_ohe(team)
                    for i in range(0, len(mask_idx)):
                        home_teams.loc[mask_idx[i]] = ohe_vals

            for col in home_teams:
                footy_data = footy_data.join(home_teams[col])
            # remove original home team data
            footy_data.drop(columns='HomeTeam', inplace=True)

            # one hot encode away team
            all_away_teams = [f'away_{t}' for t in all_teams]
            away_teams = pd.DataFrame(0, index=range(
                0, len(footy_data)), columns=all_away_teams)

            for team in all_teams:
                if team in footy_data['AwayTeam'].values:
                    mask = footy_data['AwayTeam'] == team
                    mask_idx = mask[mask].index.values
                    ohe_vals = team_ohe(team)
                    for i in range(0, len(mask_idx)):
                        away_teams.loc[mask_idx[i]] = ohe_vals

            for col in away_teams:
                footy_data = footy_data.join(away_teams[col])
            # remove original away team data
            footy_data.drop(columns='AwayTeam', inplace=True)
            return footy_data
        else:
            return 'Mismatch between home and away teams.'

    def transform(self, data, **transform_params):
        ohe_df = football_data_team_ohe(data)
        return ohe_df

    def fit(self, X, y=None, **fit_params):
        return self

In [None]:
def prfs(y_true, y_pred):
    return precision_recall_fscore_support(y_true, y_pred)

In [None]:
def f1(y_true, y_pred):
    return f1_score(y_true, y_pred)

In [None]:
def bal_acc(y_true, y_pred):
    return balanced_accuracy_score(y_true, y_pred, adjusted=True)

In [None]:
def evaluate_cluster_model(features, target, model, k_splits=10):
    cv = RepeatedStratifiedKFold(
        n_splits=k_splits, n_repeats=2, random_state=1)
    # evaluation scoring metric
    metric = make_scorer(f1)
    scores = cross_val_score(model, features, target,
                             scoring=metric, cv=cv, n_jobs=-1)
    return scores

In [None]:
def load_cluster1_features():
    train_features = pd.read_csv('cluster1_train_features.csv')
    test_features = pd.read_csv('cluster1_test_features.csv')
    return train_features, test_features

In [None]:
def load_cluster1_targets():
    train_target = pd.read_csv('cluster1_train_target.csv')
    test_target = pd.read_csv('cluster1_test_target.csv')
    return train_target, test_target

In [None]:
def load_cluster3_targets():
    train_target = pd.read_csv('cluster3_train_target.csv')
    test_target = pd.read_csv('cluster3_test_target.csv')
    return train_target, test_target

In [None]:
def load_cluster3_features():
    train_features = pd.read_csv('cluster3_train_features.csv')
    test_features = pd.read_csv('cluster3_test_features.csv')
    return train_features, test_features

In [None]:
def expected_profit(cm_matrix, cb_matrix, pos_prior):
    """ calculates expected profit """
    neg_prior = 1 - pos_prior
    #total = cm_matrix[0][0] + cm_matrix[0][1] + cm_matrix[1][0] + cm_matrix[1][1]
    total_pos = cm_matrix[1][0] + cm_matrix[1][1]
    total_neg = cm_matrix[0][0] + cm_matrix[0][1]
    # confucion matrix error rates
    tn_rate = cm_matrix[0][0] / total_neg
    fp_rate = cm_matrix[0][1] / total_neg
    fn_rate = cm_matrix[1][0] / total_pos
    tp_rate = cm_matrix[1][1] / total_pos

    exp_profit = (pos_prior * (tp_rate * cb_matrix[1][1] + fn_rate * cb_matrix[1][0])) + (
        neg_prior * (tn_rate * cb_matrix[0][0] + fp_rate * cb_matrix[0][1]))
    return np.round(exp_profit, 2)

In [None]:
def home_lr_setup():
    # instantiate model
    lr_model = LogisticRegression(
        penalty='l1',
        solver='liblinear',
        C=0.13916392773859043,
        max_iter=130,
        tol=1.0683985965344678,
        class_weight=None,
        random_state=42
    )
    # load train features, keep model features
    home_train_features = load_home_train_features_with_drop_transformed()
    lr_feats = home_train_features[['AHTGS5PG_UPoutlier', 'AHTGS5PHG_UPoutlier', 'AATGS5PAG_UPoutlier', 'AHTSOT5PG_UPoutlier',
                                    'AATSOT5PG_UPoutlier', 'AHTSOT5PHG_UPoutlier', 'AHTGD5PHG_UPoutlier', 'HT_PrevSeasonPos_inv',
                                    'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_lowqrt', 'AHTGS5PG_upqrt_AATGC5PG_lowqrt',
                                    'AHTSOT5PG_bxcx_pwrTRANSFORM', 'AATSOT5PG_bxcx_pwrTRANSFORM', 'AHTGD5PG_quantileTRANSFORM',
                                    'AATGD5PG_quantileTRANSFORM', 'AwayCapacityDiff_bin_quantileTRANSFORM',
                                    'bxcx_AHTSOT5PG_quantileTRANSFORM', 'bxcx_AATSOT5PG_quantileTRANSFORM',
                                    'bxcx_AAT_GS_P5PG_ratio_quantileTRANSFORM']]
    # load test features, keep model features
    test_features = load_test_features()
    lr_test_feats = test_features[['AHTGS5PG_UPoutlier', 'AHTGS5PHG_UPoutlier', 'AATGS5PAG_UPoutlier', 'AHTSOT5PG_UPoutlier',
                                   'AATSOT5PG_UPoutlier', 'AHTSOT5PHG_UPoutlier', 'AHTGD5PHG_UPoutlier', 'HT_PrevSeasonPos_inv',
                                   'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_lowqrt', 'AHTGS5PG_upqrt_AATGC5PG_lowqrt',
                                   'AHTSOT5PG_bxcx_pwrTRANSFORM', 'AATSOT5PG_bxcx_pwrTRANSFORM', 'AHTGD5PG_quantileTRANSFORM',
                                   'AATGD5PG_quantileTRANSFORM', 'AwayCapacityDiff_bin_quantileTRANSFORM',
                                   'bxcx_AHTSOT5PG_quantileTRANSFORM', 'bxcx_AATSOT5PG_quantileTRANSFORM',
                                   'bxcx_AAT_GS_P5PG_ratio_quantileTRANSFORM']]
    # instantiate pca and near miss undersampling
    lr_pca = PCA(n_components=12, whiten=True)
    lr_nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3, n_jobs=-1)
    lr_name = 'LR'
    # create pipeline
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    cal_lr_pipe = IMBPipeline(
        steps=[
            ('transformer', KeepColumnsTransformer(lr_feats.columns)),
            ('scaler', RobustScaler(with_centering=False)),
            ('sampling', lr_nm3),
            ('pca', lr_pca),
            ('model', CalibratedClassifierCV(lr_model, cv=cv, method='sigmoid'))
        ]
    )

    lr_pipe = IMBPipeline(
        steps=[
            ('transformer', KeepColumnsTransformer(lr_feats.columns)),
            ('scaler', RobustScaler(with_centering=False)),
            ('sampling', lr_nm3),
            ('pca', lr_pca),
            ('model', lr_model)
        ]
    )
    return lr_pipe, cal_lr_pipe, lr_test_feats, lr_name

In [None]:
def home_xgb_setup():
    # instantiate model
    xgb_model = XGBClassifier(n_estimators=320,
                              learning_rate=0.0488938809129075,
                              max_depth=3,
                              min_child_weight=8,
                              gamma=0.6936419289666603,
                              scale_pos_weight=1,
                              colsample_bytree=0.5920903811620744,
                              subsample=0.8374088758397853,
                              reg_alpha=0.07425990654719038,
                              reg_lambda=0.16768800738594583,
                              use_label_encoder=False,
                              n_jobs=-1,
                              verbosity=0,
                              random_state=42)
    # load train_features & keep model features
    home_train_features = load_home_train_features_with_drop()
    xgb_feats = home_train_features[['AHTSOT5PG', 'AATSOT5PG', 'AATGD5PAG', 'season_month_sin', 'season_month_cos',
                                     'HA_ATP5PG_diff', 'AwayCapacityDiff_bin', 'cluster_3', 'HT_PrevSeasonPos_inv',
                                     'AT_PrevSeasonPos_inv', 'HT_GSGC_UPLOW_QRT', 'AHTGC5PG_upqrt_AATGC5PG_lowqrt']]
    # predict cluster 3 for test features
    test_cl_feats = load_cluster_test_feats()
    gb_cl3_model = joblib.load('cl3_gb_model.sav')
    gb_pred = gb_cl3_model.predict(test_cl_feats)
    # load test features, add cluster 3 prediction, keep model features
    test_features = load_test_features()
    test_features['cluster_3'] = gb_pred
    xgb_test_feats = test_features[['AHTSOT5PG', 'AATSOT5PG', 'AATGD5PAG', 'season_month_sin', 'season_month_cos',
                                    'HA_ATP5PG_diff', 'AwayCapacityDiff_bin', 'cluster_3', 'HT_PrevSeasonPos_inv',
                                    'AT_PrevSeasonPos_inv', 'HT_GSGC_UPLOW_QRT', 'AHTGC5PG_upqrt_AATGC5PG_lowqrt']]
    # instantiate oversampling method
    xgb_bsmote = BorderlineSMOTE(
        k_neighbors=12, m_neighbors=20, kind='borderline-1', random_state=1, n_jobs=-1)
    xgb_name = 'XGB'
    # creat pipeline
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    xgb_pipe = IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', StandardScaler(with_mean=False, with_std=True)),
            ('os', xgb_bsmote),
            ('model', xgb_model)
        ]
    )

    cal_xgb_pipe = IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', StandardScaler(with_mean=False, with_std=True)),
            ('os', xgb_bsmote),
            ('model', CalibratedClassifierCV(xgb_model, cv=cv, method='sigmoid'))
        ]
    )
    return xgb_pipe, cal_xgb_pipe, xgb_test_feats, xgb_name

In [None]:
def away_gb_setup():
    # instantiate model
    gb_model = GradientBoostingClassifier(n_estimators=650,
                                          learning_rate=0.027115584500934167,
                                          max_depth=3,
                                          max_features=8,
                                          min_samples_leaf=420,
                                          min_samples_split=675,
                                          subsample=0.43918481213848615,
                                          #ccp_alpha =  1.1243946224171244,
                                          random_state=42)
    # load train features & keep model features
    away_train_features = load_away_train_features_with_drop()
    gb_feats = away_train_features[['AHTGC5PHG', 'AATSOT5PG', 'AHTSOT5PHG', 'HA_AHTGS5PG_diff', 'AHT_GS_P5PG_ratio',
                                    'AwayCapacityDiff_bin', 'cluster_1', 'cluster_3', 'HT_PrevSeasonPos_inv',
                                    'AT_PrevSeasonPos_inv', 'bxcx_AHTGC5PG', 'bxcx_AATSOT5PG', 'bxcx_AHT_GS_P5PG_ratio']]
    # get predictions for cluster 1 & 3 features
    test_cluster_feats = load_cluster_test_feats()
    rf_cl1_model = joblib.load('cl1_rf_model.sav')
    rf_pred = rf_cl1_model.predict(test_cluster_feats)
    gb_cl3_model = joblib.load('cl3_gb_model.sav')
    gb_pred = gb_cl3_model.predict(test_cluster_feats)
    # load test features, add cluster 1 & 3 predictions, keep model features
    test_features = load_test_features()
    test_features['cluster_1'] = rf_pred
    test_features['cluster_3'] = gb_pred
    gb_test_feats = test_features[['AHTGC5PHG', 'AATSOT5PG', 'AHTSOT5PHG', 'HA_AHTGS5PG_diff', 'AHT_GS_P5PG_ratio',
                                   'AwayCapacityDiff_bin', 'cluster_1', 'cluster_3', 'HT_PrevSeasonPos_inv',
                                   'AT_PrevSeasonPos_inv', 'bxcx_AHTGC5PG', 'bxcx_AATSOT5PG', 'bxcx_AHT_GS_P5PG_ratio']]
    # instantiate oversampling method
    gb_adasyn = ADASYN(n_neighbors=6, random_state=1, n_jobs=-1)
    gb_name = 'GB'
    # creat pipeline
    gb_pipe = IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(gb_feats.columns)),
            ('scaler', StandardScaler(with_mean=False, with_std=True)),
            ('os', gb_adasyn),
            ('model', gb_model)
        ]
    )
    return gb_pipe, gb_test_feats, gb_name

In [None]:
def away_xgb_setup():
    # instantiate model
    xgb_model = XGBClassifier(n_estimators=570,
                              learning_rate=0.018277402480679054,
                              max_depth=3,
                              min_child_weight=10,
                              gamma=0.6757959901913699,
                              subsample=0.7962485851393263,
                              colsample_bytree=0.5735095367483903,
                              scale_pos_weight=1,
                              reg_alpha=0.13029677939259385,
                              reg_lambda=0.10315636718690373,
                              use_label_encoder=False,
                              n_jobs=-1,
                              verbosity=0,
                              random_state=42)
    # load train features, keep model features
    away_train_features = load_away_train_features_with_drop()
    xgb_feats = away_train_features[['AHTGC5PG', 'AATP5PAG', 'DayofWeek', 'HA_ATP5PG_diff', 'cluster_1', 'HT_PrevSeasonPos_inv',
                                     'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_upqrt', 'AHTGC5PG_upqrt_AATGC5PG_lowqrt']]
    # load test cluster features, get predictions for cluster 1 feature
    test_cluster_feats = load_cluster_test_feats()
    rf_cl1_model = joblib.load('cl1_rf_model.sav')
    rf_pred = rf_cl1_model.predict(test_cluster_feats)
    # load test features, add cluster 1 predictions, keep model features
    test_features = load_test_features()
    test_features['cluster_1'] = rf_pred
    xgb_test_feats = test_features[['AHTGC5PG', 'AATP5PAG', 'DayofWeek', 'HA_ATP5PG_diff', 'cluster_1', 'HT_PrevSeasonPos_inv',
                                    'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_upqrt', 'AHTGC5PG_upqrt_AATGC5PG_lowqrt']]
    # instantiate oversampling method
    xgb_svms = SVMSMOTE(k_neighbors=8, m_neighbors=14,
                        svm_estimator=SVC(), random_state=1, n_jobs=-1)
    xgb_name = 'XGB'
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    # creat pipeline
    xgb_pipe = IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', StandardScaler(with_mean=False, with_std=True)),
            ('os', xgb_svms),
            ('model', xgb_model)
        ]
    )

    cal_xgb_pipe = IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', StandardScaler(with_mean=False, with_std=True)),
            ('os', xgb_svms),
            ('model', CalibratedClassifierCV(xgb_model, cv=cv, method='sigmoid'))
        ]
    )
    return xgb_pipe, cal_xgb_pipe, xgb_test_feats, xgb_name

In [None]:
def away_lgbm_setup():
    lgbm_model = LGBMClassifier(n_estimators=470,
                                learning_rate=0.04250057994580665,
                                objective='binary',
                                max_depth=4,
                                num_leaves=35,
                                min_child_samples=1320,
                                colsample_bytree=0.7787949835735254,
                                subsample=0.8073575692094029,
                                reg_alpha=0.06951672237498768,
                                reg_lambda=0.03784851967467663,
                                is_unbalance=True,
                                n_jobs=-1,
                                random_state=42)
    away_train_features = load_away_train_features_with_drop()
    lgbm_feats = away_train_features[['AHTGS5PG', 'AHTGC5PG', 'AHTSOT5PHG', 'AATSOT5PAG', 'AHTP5PHG', 'AHTGD5PG',
                                      'HA_AHTGS5PG_diff', 'HA_ATP5PG_diff', 'AHT_GS_P5PG_ratio', 'AwayCapacityDiff_bin',
                                      'HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv']]
    test_features = load_test_features()
    lgbm_test_feats = test_features[['AHTGS5PG', 'AHTGC5PG', 'AHTSOT5PHG', 'AATSOT5PAG', 'AHTP5PHG', 'AHTGD5PG',
                                     'HA_AHTGS5PG_diff', 'HA_ATP5PG_diff', 'AHT_GS_P5PG_ratio', 'AwayCapacityDiff_bin',
                                     'HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv']]
    lgbm_svms = SVMSMOTE(k_neighbors=3, m_neighbors=7,
                         random_state=1, n_jobs=-1)
    lgbm_name = 'LGBM'
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    lgbm_pipe = IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(lgbm_feats.columns)),
            ('scaler', StandardScaler(with_mean=True, with_std=True)),
            ('os', lgbm_svms),
            ('model', lgbm_model)
        ]
    )

    cal_lgbm_pipe = IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(lgbm_feats.columns)),
            ('scaler', StandardScaler(with_mean=True, with_std=True)),
            ('os', lgbm_svms),
            ('model', CalibratedClassifierCV(lgbm_model, cv=cv, method='sigmoid'))
        ]
    )
    return lgbm_pipe, cal_lgbm_pipe, lgbm_test_feats, lgbm_name

In [None]:
class KeepColumnsTransformer():
    """ 
    custom transformer that keeps specified columns of dataframe
    """

    def __init__(self, columns=None):
        self.columns = columns

    def transform(self, X, **transform_params):
        if len(self.columns) == 1:
            drop_df = np.array(X[self.columns[0]].copy()).reshape(-1, 1)
        else:
            drop_df = X[self.columns].copy()
        return drop_df

    def fit(self, X, y=None, **fit_params):
        return self

# -- Data Cleansing --

In [None]:
# import data
data = pd.read_csv("new_combined_leagues_avg_stats")

In [None]:
# remove bookys odds..
# remove referee.. high no. of nan values
# models might rely on these columns too much, might benefit beating the bookys without these
data.drop(columns=['Referee', 'LBH', 'LBD', 'LBA',
                   'WHH', 'WHD', 'WHA'], inplace=True)

In [None]:
# remove league 2 data
data = data[data['Div'] != 'E3']

## Nan Values

In [None]:
# check missing data
null_data = data[data.isnull().any(axis=1)]
null_data

In [None]:
# get rows from null data that contain 2 or less nan values
# get rows from null data that contain more than 2 rows
low_null_index = []
high_null_index = []
for row in range(len(null_data)):
    if null_data.iloc[row, :].isna().sum() <= 2:
        low_null_index.append(null_data.iloc[row, :].name)
    elif null_data.iloc[row, :].isna().sum() > 2:
        high_null_index.append(null_data.iloc[row, :].name)

In [None]:
# remove rows from data with high number of nan values
#data.drop(labels = 'index', axis = 1, inplace = True)
data.drop(labels=high_null_index, axis=0, inplace=True)
data.reset_index(drop=True, inplace=True)

In [None]:
# train/test split
train_features, test_features, train_target, test_target = train_test_split(
    data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=42, stratify=data['FTR'])
train_features.reset_index(drop=True, inplace=True)
train_target.reset_index(drop=True, inplace=True)
test_features.reset_index(drop=True, inplace=True)
test_target.reset_index(drop=True, inplace=True)
# train_features.head()

In [None]:
# get train features with nan values
nan_cols = []
for col in train_features:
    if train_features[col].isnull().values.any():
        nan_cols.append(col)

print(nan_cols)

In [None]:
# load train features for imputation re run
train_features = pd.read_csv('combined_leagues_train features.csv')

In [None]:
# similarity impute train data
num_cols = ['FTHG', 'AHTGS5PG', 'FTAG', 'AATGS5PG', 'AHTGC5PG', 'AATGC5PG', 'AHTGS5PHG',
            'AATGS5PAG', 'AHTGC5PHG', 'AATGC5PAG', 'HS', 'AS', 'HST', 'AST', 'AHTSOT5PG',
            'AATSOT5PG', 'AHTSOT5PHG', 'AATSOT5PAG', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY',
            'HR', 'AR', 'AHTP5PG', 'AATP5PG', 'AHTP5PHG', 'AATP5PAG']

# run similarity imputation on individual columns
train_features_imputed, nan_values = similarity_imputation(
    train_features_imputed, 'AATSOT5PAG', num_cols, missing_value='nan', dec_places=1, n_neighbours=7)

In [None]:
# average stat for team in that season # 1359
mask_df = train_features[train_features['HomeTeam'].str.contains('QPR') & train_features['Date'].str.contains('/02') & train_features['Div'].str.contains(
    'E2') | train_features['HomeTeam'].str.contains('QPR') & train_features['Date'].str.contains('/03') & train_features['Div'].str.contains('E2')]
mask_idx = []
for i in range(len(mask_df)):
    yr = int(mask_df.iloc[i, :]['Date'].split('/')[2])
    mnth = int(mask_df.iloc[i, :]['Date'].split('/')[1])
    if yr == int('02'):
        if mnth >= int('08'):
            mask_idx.append(i)
    elif yr == int('03'):
        if mnth <= int('04'):
            mask_idx.append(i)
mean = mask_df.iloc[mask_idx, :]['HS'].mean()
mask = train_features[train_features['HS'].isnull()]['HomeTeam'] == 'QPR'
mask_idx = mask[mask].index.values
train_features.loc[mask_idx, 'HS'] = mean

In [None]:
# remove nans fromt test data
test_nan_idx = test_features[test_features.isnull().any(axis=1)].index
test_features.drop(labels=test_nan_idx, axis=0, inplace=True)
test_target.drop(labels=test_nan_idx, inplace=True)
test_features.reset_index(drop=True, inplace=True)
test_target.reset_index(drop=True, inplace=True)

In [None]:
# remove any duplicates
train_features.drop_duplicates(inplace=True)

In [None]:
train_features.describe()

In [None]:
train_features.describe(include=['O'])

## Outliers

In [None]:
# continuous outlier check
train_features = load_train_features()
outlier_dict = {}
for col in train_features:
    if train_features[col].dtype == 'int64' or train_features[col].dtype == 'float64':
        stats = train_features[col].describe()
        IQR = stats['75%'] - stats['25%']
        upper = stats['75%'] + 1.5 * IQR
        lower = stats['25%'] - 1.5 * IQR
        lower_bound_outliers = train_features[train_features[col] < lower]
        upper_bound_outliers = train_features[train_features[col] > upper]
        # update dictionary
        outlier_dict[col] = (upper, lower)
        print('{}: {} upper outliers and {} lower outliers, bounds: upper bound = {}, lower bound = {}\n'.format(
            col, len(upper_bound_outliers), len(lower_bound_outliers), upper, lower))

In [None]:
# save outlier dict for future use
dict_file = open('outlier_dict.pkl', 'wb')
pickle.dump(outlier_dict, dict_file)
dict_file.close()

In [None]:
# continuous histogram plots
for col in train_features:
    if train_features[col].dtype == 'int64' or train_features[col].dtype == 'float64':
        train_features[col].hist(bins=100)
        plt.title(f'{col} hist plot')
        plt.xlabel(f'{col}')
        plt.ylabel('Frequency')
        plt.grid(alpha=0.3)
        plt.show()

no observational outliers, outliers within features can be put down to rare results that football throws out time to time and not errornous recorded observations.

## Target Engineering

In [None]:
# binarise target for home team win... H = 1, D & A = 0
home_win_train_target = train_target.copy()
home_win_test_target = test_target.copy()
# train target
h_idx = train_target[train_target == 'H'].index
a_idx = train_target[train_target == 'A'].index
d_idx = train_target[train_target == 'D'].index
home_win_train_target[h_idx] = 1
home_win_train_target[a_idx] = 0
home_win_train_target[d_idx] = 0
# test target
h_idx = test_target[test_target == 'H'].index
a_idx = test_target[test_target == 'A'].index
d_idx = test_target[test_target == 'D'].index
home_win_test_target[h_idx] = 1
home_win_test_target[a_idx] = 0
home_win_test_target[d_idx] = 0

In [None]:
# binarise target for away team win... A = 1, D & H = 0
away_win_train_target = train_target.copy()
away_win_test_target = test_target.copy()
# train target
a_idx = train_target[train_target == 'A'].index
h_idx = train_target[train_target == 'H'].index
d_idx = train_target[train_target == 'D'].index
away_win_train_target[a_idx] = 1
away_win_train_target[h_idx] = 0
away_win_train_target[d_idx] = 0
# test target
a_idx = test_target[test_target == 'A'].index
h_idx = test_target[test_target == 'H'].index
d_idx = test_target[test_target == 'D'].index
away_win_test_target[a_idx] = 1
away_win_test_target[h_idx] = 0
away_win_test_target[d_idx] = 0

## - Summary -

Removed 'Referee' feature as had a high number of nan values, also contains a high number of unique values and likely not to provide a great deal of information on the target.

Bookmaker odds were also removed as didnt want these features to influence the models as the general idea is to use team and match variables to identify whether the home or away team are more likely to win. Ultimately the idea is to use team and match information to gain leverage over the bookmakers. Although this should come under review as bookmaker odds could provide latent information about the teams and matches being played, this will provide a good comparison for future analysis.

A chunk of nan values were removed from the head of the data, these nan values were a result of averaging features from 5 previous games and as there are no previous games to take averages on they subsequently have nan value inputs. The data were also checked for rows containing more than 2 nan values and subsequently were removed, for rows containing 2 or less nan values similarity imputation was applied using euclidean distance and averages were imputed for team and features. For the test data rows containing nan values were removed.

Description of data shows nothing unexpected.

No observable outliers were found but using the interquartile range outliers are present in some features and will need to be addressed.

Home win and away win targets have been engineered from the full time result with 1 representing win and 0 representing a draw or loss.

# -- EDA --

In [None]:
home_df = load_home_win_train_df()
away_df = load_away_win_train_df()

print('All Leagues:')
ag_hw = home_df['HomeFTR'].value_counts()[1]
ag_hnw = home_df['HomeFTR'].value_counts()[0]
ag_aw = away_df['AwayFTR'].value_counts()[1]
ag_anw = away_df['AwayFTR'].value_counts()[0]
print(f'    Home Win Count:\n        Win: {ag_hw}\n       No Win: {ag_hnw}\n')
print(f'    Away Win Count:\n        Win: {ag_aw}\n       No Win: {ag_anw}\n')
ag_tot_games = ag_hw + ag_hnw
ag_hw_perc = ag_hw / ag_tot_games
ag_aw_perc = ag_aw / ag_tot_games
print(
    f'  All Leagues Since 00/01..  Home Win Percentage: {np.round(ag_hw_perc,2)}        Away Win Percentage: {np.round(ag_aw_perc,2)}\n')

print('Premier League:')
prem_hw = len(home_df[(home_df['Div'] == 2) & (home_df['HomeFTR'] == 1)])
prem_hnw = len(home_df[(home_df['Div'] == 2) & (home_df['HomeFTR'] == 0)])
prem_aw = len(away_df[(away_df['Div'] == 2) & (away_df['AwayFTR'] == 1)])
prem_anw = len(away_df[(home_df['Div'] == 2) & (away_df['AwayFTR'] == 0)])
print(
    f'    Prem Home Win Count:\n        Win: {prem_hw}\n       No Win: {prem_hnw}\n')
print(
    f'    Prem Away Win Count:\n        Win: {prem_aw}\n       No Win: {prem_anw}\n')
prem_tot_games = prem_hw + prem_hnw
prem_hw_perc = prem_hw / prem_tot_games
prem_aw_perc = prem_aw / prem_tot_games
print(
    f'  Premier League Since 00,01..  Home Win Percentage: {np.round(prem_hw_perc,2)}        Away Win Percentage: {np.round(prem_aw_perc,2)}\n')

print('Championship:')
champ_hw = len(home_df[(home_df['Div'] == 1) & (home_df['HomeFTR'] == 1)])
champ_hnw = len(home_df[(home_df['Div'] == 1) & (home_df['HomeFTR'] == 0)])
champ_aw = len(away_df[(away_df['Div'] == 1) & (away_df['AwayFTR'] == 1)])
champ_anw = len(away_df[(home_df['Div'] == 1) & (away_df['AwayFTR'] == 0)])
print(
    f'    Champ Home Win Count:\n        Win: {champ_hw}\n       No Win: {champ_hnw}\n')
print(
    f'    Champ Away Win Count:\n        Win: {champ_aw}\n       No Win: {champ_anw}\n')
champ_tot_games = champ_hw + champ_hnw
champ_hw_perc = champ_hw / champ_tot_games
champ_aw_perc = champ_aw / champ_tot_games
print(
    f'  Championship Since 00/01..  Home Win Percentage: {np.round(champ_hw_perc,2)}        Away Win Percentage: {np.round(champ_aw_perc,2)}\n')

print('League One:')
l1_hw = len(home_df[(home_df['Div'] == 0) & (home_df['HomeFTR'] == 1)])
l1_hnw = len(home_df[(home_df['Div'] == 0) & (home_df['HomeFTR'] == 0)])
l1_aw = len(away_df[(away_df['Div'] == 0) & (away_df['AwayFTR'] == 1)])
l1_anw = len(away_df[(home_df['Div'] == 0) & (away_df['AwayFTR'] == 0)])
print(
    f'    League 1 Home Win Count:\n        Win: {l1_hw}\n       No Win: {l1_hnw}\n')
print(
    f'    League 1 Away Win Count:\n        Win: {l1_aw}\n       No Win: {l1_anw}\n')
l1_tot_games = l1_hw + l1_hnw
l1_hw_perc = l1_hw / l1_tot_games
l1_aw_perc = l1_aw / l1_tot_games
print(
    f'  League One Since 00/01..  Home Win Percentage: {np.round(l1_hw_perc,2)}        Away Win Percentage: {np.round(l1_aw_perc,2)}\n')

home win rates and away win rates for each league are very similar with home win rates approx. 44% and away win rates approx. 29%.

In [None]:
home_train_target, _ = load_home_targets()
away_train_target, _ = load_away_targets()
htg = home_train_target.value_counts().sum()
hw = home_train_target.value_counts()[1]
hnw = home_train_target.value_counts()[0]
hw_prop = hw/htg
atg = away_train_target.value_counts().sum()
aw = away_train_target.value_counts()[1]
anw = away_train_target.value_counts()[0]
aw_prop = aw/atg
num_samps = len(home_train_target)
#print(f'total samples: {num_samps},    home win prop: {hw_prop},    away win prop: {aw_prop}')
hstat, hpval = proportions_ztest(count=hw, nobs=num_samps, value=.05)
astat, apval = proportions_ztest(count=aw, nobs=num_samps, value=.05)
bstat, bpval = proportions_ztest(count=np.array(
    [hw, aw]), nobs=np.array([num_samps, num_samps]))
if hpval <= 0.05:
    print(
        f'Home win proportion is statistically significant and not likely to have occured by chance.\nProportion: {hw_prop}, Pvalue: {hpval}\n')
else:
    print(
        f'Home win proportion is not statistically significant and is likely to have occured by chance.\nProportion: {hw_prop}, Pvalue: {hpval}\n')

if apval <= 0.05:
    print(
        f'Away win proportion is statistically significant and not likely to have occured by chance.\nProportion: {aw_prop}, Pvalue: {apval}\n')
else:
    print(
        f'Away win proportion is not statistically significant and is likely to have occured by chance.\nProportion: {aw_prop}, Pvalue: {apval}\n')
if bpval <= 0.05:
    print(
        f'Difference in proportions is statistically significant and not likely to have occured by chance.\nPvalue: {bpval}\n')
else:
    print(
        f'Difference in proportions is not statistically significant and is likely to have occured by chance.\nPvalue: {bpval}\n')

In [None]:
train_features = load_train_features()
train_features.info()

In [None]:
# varibale type
train_features = load_train_features()

ord_feat = ['Div', 'HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv']
cyclical_feat = ['month', 'DayofWeek', 'season_month', 'season_month_sin',
                 'season_month_cos', 'DayofWeek_sin', 'DayofWeek_cos']
nominal_feat = ['HomeTeam', 'AwayTeam']
dich_feat = [col for col in train_features.columns if 'UPoutlier' in col or 'LOWoutlier' in col or 'Local_Derby' in col or 'Dist>=100' in col or 'cluster' in col or 'upqrt' in col or 'lowqrt' in col or 'UPLOW' in col or 'bigcapacitydiff_' in col]
cont_feat = [col for col in train_features.columns if col not in ord_feat +
             cyclical_feat + nominal_feat + dich_feat]

# combine cyclical feats for analysis
season_df = pd.DataFrame(data={'season_month_sin': train_features.season_month_sin,
                               'season_month_cos': train_features.season_month_cos})
day_df = pd.DataFrame(data={'DayofWeek_sin': train_features.DayofWeek_sin,
                            'DayofWeek_cos': train_features.DayofWeek_cos})

cyclical_dfs = [season_df, day_df]

In [None]:
# distribution plots for home win
home_df = load_home_win_train_df()
away_df = load_away_win_train_df()

for col in home_df:
    if col not in nominal_feat:
        if col != 'HomeFTR':
            fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)
                  ) = plt.subplots(3, 2, figsize=(12, 8))
            fig.suptitle(
                f'{col} Distribution & Correlation/Association Plots\n')
            fig.tight_layout(pad=4.0)
            # home distribution
            hc_tab = pd.crosstab(home_df[col], home_df['HomeFTR'])
            idx = hc_tab.index
            hbar1 = ax1.bar(idx, hc_tab[0], width=0.6)
            hbar2 = ax1.bar(idx, hc_tab[1], width=0.6)
            ax1.set_title(f'{col} Home win/no win Distribution Chart')
            ax1.set_xlabel(f'{col}')
            ax1.set_ylabel('Frequency')
            ax1.legend((hbar1[0], hbar2[0]), ('Home No Win', 'Home Win'))
            ax1.grid(alpha=0.3)

            # away distribution
            ac_tab = pd.crosstab(away_df[col], away_df['AwayFTR'])
            idx = ac_tab.index
            abar1 = ax2.bar(idx, ac_tab[0], width=0.6)
            abar2 = ax2.bar(idx, ac_tab[1], width=0.6)
            ax2.set_title(f'{col} Away win/no win Distribution Chart')
            ax2.set_xlabel(f'{col}')
            ax2.set_ylabel('Frequency')
            ax2.legend((abar1[0], abar2[0]), ('Away No Win', 'Away Win'))
            ax2.grid(alpha=0.3)

            if col in cont_feat or col in cyclical_feat:
                # home boxplot
                sns.boxplot(x=home_df['HomeFTR'], y=home_df[col], ax=ax3)
                ax3.set_title(f'{col} home target Correlation')
                ax3.set_xlabel('Home Full Time Result')
                ax3.set_ylabel(f'{col}')
                ax3.grid(alpha=0.3)

                # away boxplot
                sns.boxplot(x=away_df['AwayFTR'], y=away_df[col], ax=ax4)
                ax4.set_title(f'{col} away target Correlation')
                ax4.set_xlabel('Away Full Time Result')
                ax4.set_ylabel(f'{col}')
                ax4.grid(alpha=0.3)

            if col in cont_feat:
                # feature qqplot
                qqplot(home_df[col], line='s', ax=ax5)
                ax5.set_title(f'{col} Normality Plot')
                ax5.grid(alpha=0.3)
            plt.show()

In [None]:
# continuous feature power and quantile transforms
train_features = load_train_features()

yj_pt = PowerTransformer(method='yeo-johnson')
bc_pt = PowerTransformer(method='box-cox')
qt = QuantileTransformer(
    n_quantiles=1000, output_distribution='normal', random_state=1)

for col in train_features[cont_feat]:
    feat = train_features[col]
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
    fig.suptitle(f'{col} Normality Transforms\n')
    fig.tight_layout(pad=4.0)

    # original distribution
    ax1.bar(feat.value_counts().index, feat.value_counts().values)
    ax1.set_title(f'{col} Original Distribution')
    ax1.set_xlabel(f'{col}')
    ax1.set_ylabel('Frequency')
    ax1.grid(alpha=0.3)

    # yeo-johnson transform for data containing negative values
    if feat.min() < 0:
        feat_yj_trans = yj_pt.fit_transform(np.array(feat).reshape(-1, 1))
        vals, freq = np.unique(feat_yj_trans, return_counts=True)
        ax2.bar(vals, freq)
        ax2.set_title(f'{col} Yeo-Johnson Transform')
        ax2.set_xlabel(f'{col}')
        ax2.set_ylabel('Frequency')
        ax2.grid(alpha=0.3)
    # box-cox and yeo-johnson transformations
    else:
        # yeo-johnson transform
        feat_yj_trans = yj_pt.fit_transform(np.array(feat).reshape(-1, 1))
        vals, freq = np.unique(feat_yj_trans, return_counts=True)
        ax2.bar(vals, freq)
        ax2.set_title(f'{col} Yeo-Johnson Transform')
        ax2.set_xlabel(f'{col}')
        ax2.set_ylabel('Frequency')
        ax2.grid(alpha=0.3)
        # box-cox transform
        feat_bc_trans = bc_pt.fit_transform(np.array(feat + 1).reshape(-1, 1))
        vals, freq = np.unique(feat_bc_trans, return_counts=True)
        ax3.bar(vals, freq)
        ax3.set_title(f'{col} Box-Cox Transform')
        ax3.set_xlabel(f'{col}')
        ax3.set_ylabel('Frequency')
        ax3.grid(alpha=0.3)

    # quantile transformation
    feat_qt_trans = qt.fit_transform(np.array(feat).reshape(-1, 1))
    vals, freq = np.unique(feat_qt_trans, return_counts=True)
    ax4.bar(vals, freq)
    ax4.set_title(f'{col} Quantile Transform')
    ax4.set_xlabel(f'{col}')
    ax4.set_ylabel('Frequency')
    ax4.grid(alpha=0.3)

    plt.show()

In [None]:
# continuous pair plot for non transformed data with home win
home_df = load_home_win_train_df()
non_trans = [col for col in home_df.columns if 'TRANSFORM' not in col]
home_df = home_df[non_trans]
continuous = [
    col for col in home_df.columns if col in cont_feat or col == 'HomeFTR']
home_df = home_df[continuous]
drop_cols = ['FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST',
             'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR']
home_df.drop(columns=drop_cols, inplace=True)

for i in home_df.columns:
    for j in home_df.columns:
        if i == j:
            continue
        elif i == 'HomeFTR':
            continue
        elif j == 'HomeFTR':
            continue
        else:
            sns.scatterplot(data=home_df, x=i, y=j, hue='HomeFTR')
            plt.show()

In [None]:
# plot distribution of nominal feats as far too large
home_df = load_home_win_train_df()
away_df = load_away_win_train_df()

for col in home_df:
    if col in nominal_feat:
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 16))
        fig.suptitle(f'{col} Distribution & Correlation/Association Plots\n')
        fig.tight_layout(pad=10.0)

        hc_tab = pd.crosstab(home_df[col], home_df['HomeFTR'])
        h_idx = hc_tab.index
        hbar1 = ax1.bar(h_idx, hc_tab[0], width=0.6)
        hbar2 = ax1.bar(h_idx, hc_tab[1], width=0.6)
        ax1.set_title(f'{col} Home win/no win Distribution Chart')
        ax1.set_xlabel(f'{col}')
        ax1.set_ylabel('Frequency')
        ax1.set_xticklabels(h_idx, rotation=90)
        ax1.legend((hbar1[0], hbar2[0]), ('Home No Win', 'Home Win'))
        ax1.grid(alpha=0.3)

        ac_tab = pd.crosstab(away_df[col], away_df['AwayFTR'])
        a_idx = ac_tab.index
        abar1 = ax2.bar(a_idx, ac_tab[0], width=0.6)
        abar2 = ax2.bar(a_idx, ac_tab[1], width=0.6)
        ax2.set_title(f'{col} Away win/no win Distribution Chart')
        ax2.set_xlabel(f'{col}')
        ax2.set_ylabel('Frequency')
        ax2.set_xticklabels(a_idx, rotation=90)
        ax2.legend((abar1[0], abar2[0]), ('Away No Win', 'Away Win'))
        ax2.grid(alpha=0.3)

        plt.show()

In [None]:
# plot proportions of win/no win for nominal features
home_df = load_home_win_train_df()
away_df = load_away_win_train_df()

for col in home_df:
    if col in nominal_feat:
        # home df proportions
        h_ct = pd.crosstab(home_df[col], home_df['HomeFTR'])
        h_idx = h_ct.index
        home_team_win_perc = []
        home_team_nowin_perc = []
        for h_team in h_idx:
            ht_wins = h_ct[h_ct.index == h_team][1].values
            ht_nowin = h_ct[h_ct.index == h_team][0].values
            ht_tot = ht_wins + ht_nowin
            hwin_perc = float(ht_wins / ht_tot)
            hnowin_perc = float(ht_nowin / ht_tot)
            home_team_win_perc.append(hwin_perc)
            home_team_nowin_perc.append(hnowin_perc)

        # away df proportions
        a_ct = pd.crosstab(away_df[col], away_df['AwayFTR'])
        a_idx = a_ct.index
        away_team_win_perc = []
        away_team_nowin_perc = []
        for a_team in a_idx:
            at_wins = a_ct[a_ct.index == a_team][1].values
            at_nowin = a_ct[a_ct.index == a_team][0].values
            at_tot = at_wins + at_nowin
            awin_perc = float(at_wins / at_tot)
            anowin_perc = float(at_nowin / at_tot)
            away_team_win_perc.append(awin_perc)
            away_team_nowin_perc.append(anowin_perc)

        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 16))
        fig.suptitle(f'{col} Win/NO Win Proportion Plots\n')
        fig.tight_layout(pad=13.0)
        # home plot
        hbar1 = ax1.bar(h_idx, home_team_nowin_perc, width=0.6)
        hbar2 = ax1.bar(h_idx, home_team_win_perc, width=0.6)
        ax1.set_title(f'{col} Home win/no win Proportions')
        ax1.set_xlabel(f'{col}')
        ax1.set_ylabel('Frequency')
        ax1.set_xticklabels(h_idx, rotation=90)
        ax1.legend((hbar1[0], hbar2[0]), ('Home No Win', 'Home Win'))
        ax1.grid(alpha=0.3)

        # away plot
        hbar1 = ax2.bar(a_idx, away_team_nowin_perc, width=0.6)
        hbar2 = ax2.bar(a_idx, away_team_win_perc, width=0.6)
        ax2.set_title(f'{col} Away win/no win Proportions')
        ax2.set_xlabel(f'{col}')
        ax2.set_ylabel('Frequency')
        ax2.set_xticklabels(a_idx, rotation=90)
        ax2.legend((hbar1[0], hbar2[0]), ('Away No Win', 'Away Win'))
        ax2.grid(alpha=0.3)
        plt.show()

In [None]:
# normality statistical test
train_features = load_train_features()
gaussian = []
non_gaussian = []

for col in train_features.columns:
    alpha = 0.05
    if col in cont_feat:
        print(f'\n{col}:')
        data = train_features[col].copy()
        shap_stat, shap_p = shapiro(data)
        dagos_stat, dagos_p = normaltest(data)
        stat_result = anderson(data)

        count = 0
        if shap_p > alpha:
            count += 1
            print(
                f'Shapiro test:  Sample looks gaussian, stat = {shap_stat}, p = {shap_p}')
        else:
            print(
                f'Shapiro test:  Sample does not look gaussian, stat = {shap_stat}, p = {shap_p}')

        if dagos_p > alpha:
            count += 1
            print(
                f"D'agostino test:  Sample looks gaussian, stat = {dagos_stat}, p = {dagos_p}")
        else:
            print(
                f"D'agostino test:  Sample does not look gaussian, stat = {dagos_stat}, p = {dagos_p}")

        if stat_result.statistic < stat_result.critical_values[0]:
            count += 1
            print(f"Anderson test:  Sample looks gaussian")
        else:
            print(f"Anderson test:  Sample does not look gaussian")
        for i in range(len(stat_result.critical_values)):
            sl, cv = stat_result.significance_level[i], stat_result.critical_values[i]
            print(
                f"Anderson test:  stat = {stat_result.statistic}, sig = {sl}, crit val = {cv}")

        if count >= 2:
            gaussian.append(col)
        elif count <= 1:
            non_gaussian.append(col)

print(f'\nGaussian Features: {gaussian}')
print(f'\nNon Gaussian Features: {non_gaussian}')

In [None]:
# heatmap for colinearity
features = load_train_features()
#home_df = load_home_win_train_df()
#away_df = load_away_win_train_df()

heatmap_dict = {}
columns = []

for col1 in features:
    if col1 not in cyclical_feat:
        columns.append(col1)
        corr_vals = []
        f1 = features[col1].copy()
        for col2 in features:
            if col2 not in cyclical_feat:
                f2 = features[col2].copy()
                if col1 in cont_feat and col2 in cont_feat:  # continuous - continuous.. spearman
                    score = spearmanr(f1, f2)[0]
                    corr_vals.append(score)
                if col1 in cont_feat and col2 in ord_feat:  # continuous - ordinal.. spearman
                    score = spearmanr(f1, f2)[0]
                    corr_vals.append(score)
                if col1 in cont_feat and col2 in nominal_feat:  # continuous - nominal.. logistic regression fbeta
                    model = LogisticRegression(
                        class_weight='balanced', multi_class='multinomial')
                    scores = evaluate_model(
                        np.array(f1).reshape(-1, 1), f2, model)
                    corr_vals.append(np.mean(scores))
                if col1 in cont_feat and col2 in dich_feat:  # continuous - dichotomous.. logitsic regression
                    model = LogisticRegression(class_weight='balanced')
                    scores = evaluate_model(
                        np.array(f1).reshape(-1, 1), f2, model, k_splits=3)
                    corr_vals.append(np.mean(scores))
                if col1 in nominal_feat and col2 in cont_feat:  # nominal - continuous.. mars R2
                    oh_f1 = pd.get_dummies(f1)
                    model = Earth()
                    scores = cross_val_score(
                        model, oh_f1, f2, scoring='r2', cv=5, n_jobs=-1)
                    corr_vals.append(np.mean(scores))
                if col1 in nominal_feat and col2 in ord_feat:  # nominal - ordinal.. cramers v
                    c_table = pd.crosstab(f1, f2).values
                    score = cramers_corrected_stat(c_table)
                    corr_vals.append(score)
                if col1 in nominal_feat and col2 in nominal_feat:  # nominal - nominal.. cramers v
                    c_table = pd.crosstab(f1, f2).values
                    score = cramers_corrected_stat(c_table)
                    corr_vals.append(score)
                if col1 in nominal_feat and col2 in dich_feat:  # nominal - dichotomous.. cramers v
                    c_table = pd.crosstab(f1, f2).values
                    score = cramers_corrected_stat(c_table)
                    corr_vals.append(score)
                if col1 in ord_feat and col2 in cont_feat:  # ordinal - continuous.. spearman
                    score = spearmanr(f1, f2)[0]
                    corr_vals.append(score)
                if col1 in ord_feat and col2 in ord_feat:  # ordinal - ordinal.. spearman
                    score = spearmanr(f1, f2)[0]
                    corr_vals.append(score)
                if col1 in ord_feat and col2 in nominal_feat:  # ordinal - nominal.. cramers v
                    c_table = pd.crosstab(f1, f2).values
                    score = cramers_corrected_stat(c_table)
                    corr_vals.append(score)
                if col1 in ord_feat and col2 in dich_feat:  # ordinal - dichtomous.. spearman
                    score = spearmanr(f1, f2)[0]
                    corr_vals.append(score)
                if col1 in dich_feat and col2 in cont_feat:  # dichotomous - continuous.. mars r2
                    model = Earth()
                    scores = cross_val_score(
                        model, f1, f2, scoring='r2', cv=5, n_jobs=-1)
                    corr_vals.append(np.mean(scores))
                if col1 in dich_feat and col2 in ord_feat:  # dichotomous - ordinal.. spearman
                    score = spearmanr(f1, f2)[0]
                    corr_vals.append(score)
                if col1 in dich_feat and col2 in nominal_feat:  # dichotomous - nominal.. cramers v
                    c_table = pd.crosstab(f1, f2).values
                    score = cramers_corrected_stat(c_table)
                    corr_vals.append(score)
                if col1 in dich_feat and col2 in dich_feat:  # dichotmous - dichotomous.. phi
                    c_table = pd.crosstab(f1, f2).values
                    score = phi(c_table)
                    corr_vals.append(score)

        heatmap_dict[col1] = corr_vals

corr_grid = pd.DataFrame(heatmap_dict, index=columns)

plt.figure(figsize=(30, 24))
plt.title('Feature Colinearity Plot')

sns.heatmap(corr_grid, vmin=-1, vmax=1, cmap='RdBu_r', linewidths=0.3)
plt.show()

In [None]:
# home feature / target .. correlation / association.. mutual info
home_df = load_home_win_train_df()
heatmap_dict = {}
columns = []
m_info = []
alpha = 0.05
h_significant = []
h_non_significant = []

for col in home_df:
    if col not in cyclical_feat and col != 'HomeFTR':
        columns.append(col)
        feat = home_df[col].copy()
        targ = home_df['HomeFTR'].copy()
        # continuous - dichotomous.. fbeta maximizing recall (minimise false negs)
        if col in cont_feat:
            model = LogisticRegression(class_weight='balanced')
            scores = evaluate_model(np.array(feat).reshape(-1, 1), targ, model)
            heatmap_dict[col] = np.mean(scores)
            # mutual info
            mi = mutual_info_classif(np.array(feat).reshape(-1, 1), targ)
            m_info.append(mi)
            # hypothesis test - convert to categorical
            feat_bin = pd.qcut(feat, 10, duplicates='drop')
            bin_ctable = pd.crosstab(feat_bin, targ)
            chi2_stats = chi2_contingency(bin_ctable, correction=True)
            print(f'\n{col}/HomeFTR hypothesis test:\n')
            if chi2_stats[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nChi2: {chi2_stats[0]}, Pvalue : {chi2_stats[1]}')
                h_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nChi2: {chi2_stats[0]}, Pvalue : {chi2_stats[1]}')
                h_significant.append(col)
        if col in nominal_feat:  # nominal - dichotomous.. cramers v
            c_table = pd.crosstab(feat, targ).values
            score = cramers_corrected_stat(c_table)
            heatmap_dict[col] = score
            # mutual info
            onehot_df = pd.get_dummies(feat)
            mi = mutual_info_classif(onehot_df, targ)
            m_info.append(np.sum(mi))
            # hypothesis test
            chi2_stats = chi2_contingency(c_table, correction=True)
            print(f'\n{col}/HomeFTR hypothesis test:\n')
            if chi2_stats[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nCramers V: {score}, Pvalue : {chi2_stats[1]}')
                h_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nCramers V: {score}, Pvalue : {chi2_stats[1]}')
                h_significant.append(col)
        if col in ord_feat:  # ordinal - dichtomous.. spearman rank
            spman_test = spearmanr(feat, targ)
            heatmap_dict[col] = spman_test[0]
            # mutual info
            mi = mutual_info_classif(np.array(feat).reshape(-1, 1), targ)
            m_info.append(mi)
            # hypothesis test
            print(f'\n{col}/HomeFTR hypothesis test:\n')
            if spman_test[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nSpearman stat: {spman_test[0]}, Pvalue : {spman_test[1]}')
                h_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nSpearman stat: {spman_test[0]}, Pvalue : {spman_test[1]}')
                h_significant.append(col)
        if col in dich_feat:  # dichtomous - dichotomous.. phi
            c_table = pd.crosstab(feat, targ).values
            score = phi(c_table)
            heatmap_dict[col] = score
            # mutual info
            mi = mutual_info_classif(np.array(feat).reshape(-1, 1), targ)
            m_info.append(mi)
            # hypothesis test
            print(f'\n{col}/HomeFTR hypothesis test:\n')
            chi2_stats = chi2_contingency(c_table, correction=True)
            if chi2_stats[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nPhi: {score}, Pvalue : {chi2_stats[1]}')
                h_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nPhi: {score}, Pvalue : {chi2_stats[1]}')
                h_significant.append(col)

print(
    f'\nSignificant: {h_significant}\n\nNon-Significant: {h_non_significant}')

for df in cyclical_dfs:
    if 'DayofWeek_sin' in df:
        col_name = 'DayofWeek'
        columns.append(col_name)
    if 'season_month_sin' in df:
        col_name = 'season_month'
        columns.append(col_name)

    corr_vals = []
    model = LogisticRegression(class_weight='balanced')
    scores = evaluate_model(df, targ, model)
    corr_vals.append(np.mean(scores))
    heatmap_dict[col_name] = corr_vals
    # mutual info
    mi = mutual_info_classif(df, targ)
    m_info.append(np.sum(mi))

corr_grid = pd.DataFrame(heatmap_dict)
fig, (ax1, ax2) = plt.subplots(2, figsize=(20, 16))
fig.tight_layout(pad=12.0)
sns.heatmap(corr_grid, vmin=-1, vmax=1,
            cmap='RdBu_r', yticklabels=False, ax=ax1)
ax1.set_title('Home Dataframe Feature/Target Correlation/Association')
ax1.set_ylabel('HomeFTR')

ax2.bar(columns, m_info)
ax2.set_title('Home Dataframe Feature/Target Mutual Information')
ax2.set_xlabel('Features')
ax2.set_ylabel('Mutual Information')
ax2.set_xticklabels(columns, rotation=90)
ax2.grid(alpha=0.3)

plt.show()

In [None]:
# away feature / target .. correlation / association.. mutual info
away_df = load_away_win_train_df()
heatmap_dict = {}
columns = []
m_info = []
alpha = 0.05
a_significant = []
a_non_significant = []

for col in away_df:
    if col not in cyclical_feat and col != 'AwayFTR':
        columns.append(col)
        feat = away_df[col].copy()
        targ = away_df['AwayFTR'].copy()
        if col in cont_feat:  # continuous - dichotomous.. roc auc score
            model = LogisticRegression(class_weight='balanced')
            scores = evaluate_model(np.array(feat).reshape(-1, 1), targ, model)
            heatmap_dict[col] = np.mean(scores)
            # mutual info
            mi = mutual_info_classif(np.array(feat).reshape(-1, 1), targ)
            m_info.append(mi)
            # hypothesis test - convert to categorical
            feat_bin = pd.qcut(feat, 10, duplicates='drop')
            bin_ctable = pd.crosstab(feat_bin, targ)
            chi2_stats = chi2_contingency(bin_ctable, correction=True)
            print(f'\n{col}/AwayFTR hypothesis test:\n')
            if chi2_stats[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nChi2: {chi2_stats[0]}, Pvalue : {chi2_stats[1]}')
                a_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nChi2: {chi2_stats[0]}, Pvalue : {chi2_stats[1]}')
                a_significant.append(col)
        if col in nominal_feat:  # nominal - dichotomous.. cramers v
            c_table = pd.crosstab(feat, targ).values
            score = cramers_corrected_stat(c_table)
            heatmap_dict[col] = score
            # mutual info
            onehot_df = pd.get_dummies(feat)
            mi = mutual_info_classif(onehot_df, targ)
            m_info.append(np.sum(mi))
            # hypothesis test
            chi2_stats = chi2_contingency(c_table, correction=True)
            print(f'\n{col}/AwayFTR hypothesis test:\n')
            if chi2_stats[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nCramers V: {score}, Pvalue : {chi2_stats[1]}')
                a_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nCramers V: {score}, Pvalue : {chi2_stats[1]}')
                a_significant.append(col)
        if col in ord_feat:  # ordinal - dichtomous.. spearman rank
            score = spearmanr(feat, targ)[0]
            heatmap_dict[col] = score
            # mutual info
            mi = mutual_info_classif(np.array(feat).reshape(-1, 1), targ)
            m_info.append(mi)
            # hypothesis test
            print(f'\n{col}/AwayFTR hypothesis test:\n')
            if spman_test[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nSpearman stat: {spman_test[0]}, Pvalue : {spman_test[1]}')
                a_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nSpearman stat: {spman_test[0]}, Pvalue : {spman_test[1]}')
                a_significant.append(col)
        if col in dich_feat:  # dichtomous - dichotomous.. phi
            c_table = pd.crosstab(feat, targ).values
            score = phi(c_table)
            heatmap_dict[col] = score
            # mutual info
            mi = mutual_info_classif(np.array(feat).reshape(-1, 1), targ)
            m_info.append(mi)
            # hypothesis test
            print(f'\n{col}/AwayFTR hypothesis test:\n')
            chi2_stats = chi2_contingency(c_table, correction=True)
            if chi2_stats[1] > alpha:
                print(
                    f'Not statistically significant, likely to have occured by chance\nPhi: {score}, Pvalue : {chi2_stats[1]}')
                a_non_significant.append(col)
            else:
                print(
                    f'Statistically significant, not likely to have occured by chance\nPhi: {score}, Pvalue : {chi2_stats[1]}')
                a_significant.append(col)

print(
    f'\nSignificant: {a_significant}\n\nNon-Significant: {a_non_significant}')

for df in cyclical_dfs:
    if 'DayofWeek_sin' in df:
        col_name = 'DayofWeek'
        columns.append(col_name)
    if 'season_month_sin' in df:
        col_name = 'season_month'
        columns.append(col_name)

    corr_vals = []
    model = LogisticRegression(class_weight='balanced')
    scores = evaluate_model(df, targ, model)
    corr_vals.append(np.mean(scores))
    heatmap_dict[col_name] = corr_vals
    # mutual info
    mi = mutual_info_classif(df, targ)
    m_info.append(np.sum(mi))

corr_grid = pd.DataFrame(heatmap_dict)
fig, (ax1, ax2) = plt.subplots(2, figsize=(20, 16))
fig.tight_layout(pad=12.0)
sns.heatmap(corr_grid, vmin=-1, vmax=1,
            cmap='RdBu_r', yticklabels=False, ax=ax1)
ax1.set_title('Away Dataframe Feature/Target Correlation/Association')
ax1.set_ylabel('AwayFTR')

ax2.bar(columns, m_info)
ax2.set_title('Away Dataframe Feature/Target Mutual Information')
ax2.set_xlabel('Features')
ax2.set_ylabel('Mutual Information')
ax2.set_xticklabels(columns, rotation=90)
ax2.grid(alpha=0.3)

plt.show()

In [None]:
# look for natural forming clusters... home data
home_df = load_home_win_train_df()
oh_home_df = football_data_team_ohe(home_df)

drop_cols = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
             'AC', 'HY', 'AY', 'HR', 'AR', 'year', 'AHTGS5PG_UPoutlier',  'AATGS5PG_UPoutlier', 'AATGS5PG_LOWoutlier',
             'AHTGC5PG_LOWoutlier', 'AATGC5PG_UPoutlier', 'AHTGS5PHG_UPoutlier', 'AHTSOT5PG_LOWoutlier', 'AHTSOT5PHG_UPoutlier',
             'AATSOT5PAG_LOWoutlier', 'AHTGS_SOT5PG_ratio_UPoutlier', 'AATGS_SOT5PG_ratio_UPoutlier', 'AATGS_SOT5PAG_ratio_UPoutlier',
             'AHTGD5PG_LOWoutlier', 'AATGD5PG_UPoutlier', 'AATGD5PG_LOWoutlier', 'AHTGD5PHG_UPoutlier', 'AHTGD5PHG_LOWoutlier',
             'Local_Derby', 'Dist>=100', 'HomeFTR']
oh_home_df.drop(columns=drop_cols, inplace=True)

# elbow method for kmeans
distortions = []
inertias = []
n_k = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=2).fit(oh_home_df)

    # avg square distances from cluster centres... euclidean distance
    distortions.append(sum(np.min(cdist(oh_home_df, kmeans.cluster_centers_,
                                        'euclidean'), axis=1)) / oh_home_df.shape[0])
    # sum of squared distances from the cluster centres
    inertias.append(kmeans.inertia_)
    n_k.append(k)

fig, (ax1, ax2) = plt.subplots(2, figsize=(16, 10))
fig.suptitle('Home Data Elbow Methods')
ax1.plot(n_k, distortions, 'bx-')
ax1.set_title('Elbow Method Using Distortion')
ax1.set_xlabel('Values of K')
ax1.set_ylabel('Distortion')
ax1.grid(alpha=0.3)

ax2.plot(n_k, inertias, 'bx-')
ax2.set_title('Elbow Method Using Inertia')
ax2.set_xlabel('Values of K')
ax2.set_ylabel('Inertia')
ax2.grid(alpha=0.3)

plt.show()

In [None]:
# look for natural forming clusters... away data
away_df = load_away_win_train_df()
oh_away_df = football_data_team_ohe(away_df)

drop_cols = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
             'AC', 'HY', 'AY', 'HR', 'AR', 'year', 'AHTGS5PG_UPoutlier', 'AHTGC5PG_LOWoutlier', 'AATGC5PG_UPoutlier',
             'AHTGS5PHG_UPoutlier', 'AATGS5PAG_UPoutlier', 'AHTGC5PHG_UPoutlier', 'AATGC5PAG_UPoutlier',
             'AHTGS_SOT5PHG_ratio_UPoutlier', 'AHTGD5PG_UPoutlier', 'AATGD5PAG_UPoutlier', 'AATGD5PAG_LOWoutlier',
             'Local_Derby', 'Dist>=100', 'AwayFTR']
oh_away_df.drop(columns=drop_cols, inplace=True)

# elbow method for kmeans
distortions = []
inertias = []
n_k = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=2).fit(oh_away_df)

    # avg square distances from cluster centres... euclidean distance
    distortions.append(sum(np.min(cdist(oh_away_df, kmeans.cluster_centers_,
                                        'euclidean'), axis=1)) / oh_away_df.shape[0])
    # sum of squared distances from the cluster centres
    inertias.append(kmeans.inertia_)
    n_k.append(k)

fig, (ax1, ax2) = plt.subplots(2, figsize=(16, 10))
fig.suptitle('Away Data Elbow Methods')
ax1.plot(n_k, distortions, 'bx-')
ax1.set_title('Elbow Method Using Distortion')
ax1.set_xlabel('Values of K')
ax1.set_ylabel('Distortion')
ax1.grid(alpha=0.3)

ax2.plot(n_k, inertias, 'bx-')
ax2.set_title('Elbow Method Using Inertia')
ax2.set_xlabel('Values of K')
ax2.set_ylabel('Inertia')
ax2.grid(alpha=0.3)

plt.show()

In [None]:
# look for natural forming clusters... all train data
train_features = load_train_features()
oh_train_feat = football_data_team_ohe(train_features)

drop_cols = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
             'AC', 'HY', 'AY', 'HR', 'AR']
oh_train_feat.drop(columns=drop_cols, inplace=True)

# elbow method for kmeans
distortions = []
inertias = []
n_k = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=2).fit(oh_train_feat)

    # avg square distances from cluster centres... euclidean distance
    distortions.append(sum(np.min(cdist(oh_train_feat, kmeans.cluster_centers_,
                                        'euclidean'), axis=1)) / oh_train_feat.shape[0])
    # sum of squared distances from the cluster centres
    inertias.append(kmeans.inertia_)
    n_k.append(k)

fig, (ax1, ax2) = plt.subplots(2, figsize=(16, 10))
fig.suptitle('Train Features Elbow Methods')
ax1.plot(n_k, distortions, 'bx-')
ax1.set_title('Elbow Method Using Distortion')
ax1.set_xlabel('Values of K')
ax1.set_ylabel('Distortion')
ax1.grid(alpha=0.3)

ax2.plot(n_k, inertias, 'bx-')
ax2.set_title('Elbow Method Using Inertia')
ax2.set_xlabel('Values of K')
ax2.set_ylabel('Inertia')
ax2.grid(alpha=0.3)

plt.show()

In [None]:
# pca for home features
pca_home_df = load_home_train_features_with_drop()  # load features
pca_home_df = football_data_team_ohe(pca_home_df)  # ohe football teams
sc = StandardScaler()
pca_home_df = sc.fit_transform(pca_home_df)
# fit pca algorithm
pca = PCA().fit(pca_home_df)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')  # for each component
plt.title('Explained Variance of Home Features')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# pca for home features without home/away teams
pca_home_df = load_home_train_features_with_drop()  # load features
pca_home_df.drop(columns=['HomeTeam', 'AwayTeam'],
                 inplace=True)  # ohe football teams
sc = StandardScaler()
pca_home_df = sc.fit_transform(pca_home_df)
# fit pca algorithm
pca = PCA().fit(pca_home_df)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')  # for each component
plt.title('Explained Variance of Home Features Without H/A Teams')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# pca for away features
pca_away_df = load_away_train_features_with_drop()  # load features
pca_away_df = football_data_team_ohe(pca_away_df)  # ohe football teams
sc = StandardScaler()
pca_away_df = sc.fit_transform(pca_away_df)
# fit pca algorithm
pca = PCA().fit(pca_away_df)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')  # for each component
plt.title('Explained Variance of Away Features')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# pca for away features without home/away teams
pca_away_df = load_away_train_features_with_drop()  # load features
pca_away_df.drop(columns=['HomeTeam', 'AwayTeam'],
                 inplace=True)  # ohe football teams
sc = StandardScaler()
pca_away_df = sc.fit_transform(pca_away_df)
# fit pca algorithm
pca = PCA().fit(pca_away_df)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')  # for each component
plt.title('Explained Variance of Away Features Withot H/A Teams')
plt.grid(alpha=0.3)
plt.show()

## - Summary -

The proprotions for home and away win seem to be statistically significant and not likely to have occured by chance. Home win proportion is approx. 0.45 and away win proportion is approx. 0.29. The difference between these proportions is also likely not to have occured by chance.

The data after engineering contains ordinal, nominal, dichtomous, continuous and cyclical features.

Majority of continuous data appears to be skewed to the right, which makes sense as with football extreme values tend to be higher... i.e a high number of goals scored will be more rare than a few goals scored. There also seems to be some gaussian like features.

Box-cox, Yeo-johnson power and Quantile transforms applied to continuous data to reduce skewness

Shapiro, D'agostino and Anderson Normality tests performed using majority vote to determine if a feature is normally distributed. All features appear to be non-gaussian.

High multicollinearity from engineered indicator features which is to be expected with a mix of none to high for other features. For numerical modelling algorithms PCA can be used to alleviate the effects of collinearity aswell as reduce dimensionality. Tree based algorithms will be unaffected by multicollinearity.

Chi sqaured and Spearmans rho stastistical significance tests performed for feature/target correlation and association with features being removed if they are likely to have occured by chance as they could affect generalisation to the population. 

Spearmans rho, Cramers v, phi and information gain tests used to analyse feature/target correlation/association with the least correlated/associated features removed. During model feature selection more features may be cut.

Looked for natural forming clusters within the data and found 4 undetermined clusters, these clusters are used as features for predicting. 

Found that reducing to 30 feature components using pca we still obtain 98% variance, this will likely change if features are cut during model feature selection but will be useful for initial model testing.

# -- Feature Engineering --

## Train Features

In [None]:
# seperate month from date
month = []
for i in range(len(train_features)):
    date = train_features['Date'][i]
    month.append(int(date.split('/')[1]))

train_features['month'] = pd.Series(data=month)

In [None]:
# seperate year from date
year = []
for i in range(len(train_features)):
    date = train_features['Date'][i]
    if len(date.split('/')[2]) <= 2:  # 2 digits for year
        year.append(datetime.datetime.strptime(
            train_features['Date'][i], '%d/%m/%y').strftime('%Y'))
    else:  # 4 digits for year
        year.append(datetime.datetime.strptime(
            train_features['Date'][i], '%d/%m/%Y').strftime('%Y'))

train_features['year'] = pd.Series(data=year)

In [None]:
# day of week... 0 = monday, 6 = sunday
date = pd.to_datetime(train_features['Date'], dayfirst=True)
train_features['DayofWeek'] = date.dt.dayofweek

In [None]:
# season month
# start august
train_features['season_month'] = 0
s_mnth_order = [8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7]
count = 0
for mnth in s_mnth_order:
    count += 1
    fm = train_features['month'] == mnth
    fm_idx = fm[fm].index.values
    train_features.loc[fm_idx, 'season_month'] = count

In [None]:
# extract cyclical nature from season month and day of week
# month cyclical
train_features['season_month_sin'] = np.sin(
    2*np.pi*train_features.season_month/12)
train_features['season_month_cos'] = np.cos(
    2*np.pi*train_features.season_month/12)
# day of week cyclical
train_features['DayofWeek_sin'] = np.sin(2*np.pi*train_features.DayofWeek/7)
train_features['DayofWeek_cos'] = np.cos(2*np.pi*train_features.DayofWeek/7)

In [None]:
# average goals scored to shots on target ratio (goals scored / shots on target)
# 5 previous games for home team and away team
train_features['AHTGS_SOT5PG_ratio'] = np.round(
    train_features.AHTGS5PG / train_features.AHTSOT5PG, 2)
train_features['AATGS_SOT5PG_ratio'] = np.round(
    train_features.AATGS5PG / train_features.AATSOT5PG, 2)

# 5 previous home games for home team
train_features['AHTGS_SOT5PHG_ratio'] = np.round(
    train_features.AHTGS5PHG / train_features.AHTSOT5PHG, 2)

# 5 previous away games for away team
train_features['AATGS_SOT5PAG_ratio'] = np.round(
    train_features.AATGS5PAG / train_features.AATSOT5PAG, 2)

In [None]:
# average goal difference (goals scored - goals conceded)
# 5 previous games for home team and away team
train_features['AHTGD5PG'] = train_features.AHTGS5PG - train_features.AHTGC5PG
train_features['AATGD5PG'] = train_features.AATGS5PG - train_features.AATGC5PG

# 5 previous home games for home team
train_features['AHTGD5PHG'] = train_features.AHTGS5PHG - \
    train_features.AHTGC5PHG

# 5 previous away games for away team
train_features['AATGD5PAG'] = train_features.AATGS5PAG - \
    train_features.AATGC5PAG

In [None]:
# remove date column
train_features.drop(columns=['Date'], inplace=True)

In [None]:
# convert division into ordinal numbers
# premier league
prem = train_features['Div'] == 'E0'
prem_idx = prem[prem].index.values
train_features.loc[prem_idx, 'Div'] = 2
# championship
champ = train_features['Div'] == 'E1'
champ_idx = champ[champ].index.values
train_features.loc[champ_idx, 'Div'] = 1
# league 1
leag1 = train_features['Div'] == 'E2'
leag1_idx = leag1[leag1].index.values
train_features.loc[leag1_idx, 'Div'] = 0

In [None]:
train_features['Div'] = train_features['Div'].astype('int64')

In [None]:
# home/away all avg sot... boxcox transformation

In [None]:
# values above/below continuous outlier bounds
features = load_train_features()
for key in outlier_dict:
    # if df has values above upper outlier bound
    if np.sum(features[key] > outlier_dict[key][0]):
        features[f'{key}_UPoutlier'] = 0
        mask = features[key] > outlier_dict[key][0]
        upper_idx = mask[mask].index.values
        features.loc[upper_idx, f'{key}_UPoutlier'] = 1
    # if df has values below lower outlier bound
    if np.sum(features[key] < outlier_dict[key][1]):
        features[f'{key}_LOWoutlier'] = 0
        mask = features[key] < outlier_dict[key][1]
        lower_idx = mask[mask].index.values
        features.loc[lower_idx, f'{key}_LOWoutlier'] = 1

# drop outliers for features we cant use
non_feats = ['FTHG_UPoutlier', 'FTAG_UPoutlier', 'HS_UPoutlier', 'AS_UPoutlier', 'HST_UPoutlier', 'AST_UPoutlier',
             'HF_UPoutlier', 'HF_LOWoutlier', 'AF_UPoutlier', 'HC_UPoutlier', 'AC_UPoutlier', 'HY_UPoutlier', 'AY_UPoutlier',
             'HR_UPoutlier', 'AR_UPoutlier']
features.drop(columns=non_feats, inplace=True)


# save train_features
save_train_features(features)

In [None]:
# home/away team AHTGS5PG difference
features = load_train_features()
features['HA_AHTGS5PG_diff'] = features['AHTGS5PG'] - features['AATGS5PG']

# save train_features
save_train_features(features)

In [None]:
# home/away team AHTP5PG difference
features = load_train_features()
features['HA_ATP5PG_diff'] = features['AHTP5PG'] - features['AATP5PG']

# save train_features
save_train_features(features)

In [None]:
# average goals scored to average points ratio
features = load_train_features()
features.drop(columns=['AHT_GS_P5PG_ratio', 'AAT_GS_P5PG_ratio'], inplace=True)
# home team
features['AHT_GS_P5PG_ratio'] = (
    features['AHTGS5PG'] + 1) / (features['AHTP5PG'] + 1)
# away team
features['AAT_GS_P5PG_ratio'] = (
    features['AATGS5PG'] + 1) / (features['AATP5PG'] + 1)

# save train_features
save_train_features(features)

In [None]:
# home - away football ground distance
features = load_train_features()
features['AwayTeamDist'] = np.nan
geolocator = Nominatim(user_agent="football_ground_distance")

team_geo_loc_dict = {}
for team in sorted(features['HomeTeam'].value_counts().index):
    t = get_football_ground(team)
    t_loc = geolocator.geocode(t[0])
    t_geo_loc = (t_loc.latitude, t_loc.longitude)
    team_geo_loc_dict[team] = t_geo_loc


for h_team in sorted(features['HomeTeam'].value_counts().index):
    away_teams = list(
        set(features[(features['HomeTeam'] == h_team)]['AwayTeam'].values))
    for a_team in away_teams:
        h_team_loc = team_geo_loc_dict.get(h_team)
        a_team_loc = team_geo_loc_dict.get(a_team)
        a_team_dist = np.round(geodesic(h_team_loc, a_team_loc).miles, 2)
        mask = (features['HomeTeam'] == h_team) & (
            features['AwayTeam'] == a_team)
        g_idx = mask[mask].index.values
        features.loc[g_idx, 'AwayTeamDist'] = a_team_dist

save_train_features(features)

In [None]:
# home/ away stadium capacity difference
features = load_train_features()

features['AwayCapacityDiff'] = np.nan

for h_team in sorted(features['HomeTeam'].value_counts().index):
    away_teams = list(
        set(features[(features['HomeTeam'] == h_team)]['AwayTeam'].values))
    for a_team in away_teams:
        htg = get_football_ground(h_team)
        atg = get_football_ground(a_team)
        cap_diff = atg[1] - htg[1]
        mask = (features['HomeTeam'] == h_team) & (
            features['AwayTeam'] == a_team)
        g_idx = mask[mask].index.values
        features.loc[g_idx, 'AwayCapacityDiff'] = cap_diff

save_train_features(features)

In [None]:
# bin away ground distance
features = load_train_features()
features['AwayTeamDist_bin'] = np.floor_divide(features['AwayTeamDist'], 10)

save_train_features(features)

In [None]:
# bin capacity difference
features = load_train_features()
features['AwayCapacityDiff_bin'] = np.floor_divide(
    features['AwayCapacityDiff'], 1000)

save_train_features(features)

In [None]:
# local derby - distance 10 or below
features = load_train_features()
features['Local_Derby'] = 0

mask = features['AwayTeamDist'] <= 10.0
ld_idx = mask[mask].index.values
features.loc[ld_idx, 'Local_Derby'] = 1

save_train_features(features)

In [None]:
# distance > = 100
features = load_train_features()
features['Dist>=100'] = 0

mask = features['AwayTeamDist'] >= 100
d_idx = mask[mask].index.values
features.loc[d_idx, 'Dist>=100'] = 1

save_train_features(features)

In [None]:
# run k means on train features with 4 clusters
# observe clusters
# engineer features for clusters
# check association

train_features = load_train_features()
oh_train_feat = football_data_team_ohe(train_features)

drop_cols = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
             'AC', 'HY', 'AY', 'HR', 'AR']
oh_train_feat.drop(columns=drop_cols, inplace=True)
kmeans = KMeans(n_clusters=4, random_state=2).fit(oh_train_feat)
train_features['cluster'] = kmeans.labels_

for i in range(0, 4):
    train_features[f'cluster_{i}'] = 0
    mask = train_features['cluster'] == i
    cl_idx = mask[mask].index.values
    train_features.loc[cl_idx, f'cluster_{i}'] = 1

train_features.drop(columns=['cluster'], inplace=True)

save_train_features(train_features)

In [None]:
# previous end of season position

# load dictionary
dict_file = open('season_dictionary.pkl', 'rb')
season_dict = pickle.load(dict_file)
dict_file.close()
# load train features
train_features = load_train_features()
# create features for new data
train_features['HT_PrevSeasonPos'] = np.nan
train_features['AT_PrevSeasonPos'] = np.nan

for i in range(0, 21):
    if i == 0:
        # get dict season.. change for dif vals of i!!
        season = season_dict.get(f'99/0{i}')
        first_half = train_features[(train_features['year'] == int(f'200{i}')) & (
            train_features['month'] >= 8)]  # first half season
        second_half = train_features[(train_features['year'] == int(
            f'200{i+1}')) & (train_features['month'] < 8)]  # second half season
    elif 0 < i < 10:
        season = season_dict.get(f'0{i-1}/0{i}')
        first_half = train_features[(train_features['year'] == int(f'200{i}')) & (
            train_features['month'] >= 8)]  # first half season
        if i == 9:
            second_half = train_features[(train_features['year'] == int(
                f'20{i+1}')) & (train_features['month'] < 8)]  # second half season
        else:
            second_half = train_features[(train_features['year'] == int(
                f'200{i+1}')) & (train_features['month'] < 8)]  # second half season
    elif i == 10:
        season = season_dict.get(f'0{i-1}/{i}')
        first_half = train_features[(train_features['year'] == int(f'20{i}')) & (
            train_features['month'] >= 8)]  # first half season
        second_half = train_features[(train_features['year'] == int(
            f'20{i+1}')) & (train_features['month'] < 8)]  # second half season
    else:
        season = season_dict.get(f'{i-1}/{i}')
        first_half = train_features[(train_features['year'] == int(f'20{i}')) & (
            train_features['month'] >= 8)]  # first half season
        second_half = train_features[(train_features['year'] == int(
            f'20{i+1}')) & (train_features['month'] < 8)]  # second half season

    table = season[0][1:]  # prem league but appending all standings into this
    champ = season[1][1:]
    le1 = season[2][1:]  # join each season to make full standings df
    le2 = season[3][1:]
    for l in [champ, le1, le2]:
        for standing in l:
            table.append(standing)
    full_season = first_half.append(
        second_half, ignore_index=True)  # first and second joined
    h_teams = sorted(
        full_season['HomeTeam'].value_counts().index)  # home teams
    a_teams = sorted(
        full_season['AwayTeam'].value_counts().index)  # away teams
    for h_team in h_teams:
        #     get masks for team in dictionary from data
        h_first_mask = first_half['HomeTeam'] == h_team
        h_second_mask = second_half['HomeTeam'] == h_team
        # index of home team in data
        h_first_idx = list(h_first_mask[h_first_mask].index.values)
        h_second_idx = list(h_second_mask[h_second_mask].index.values)
        h_mask_idx = sorted(h_first_idx + h_second_idx)
#     get position from team in dictionary into new column
        for position in table:
            if position[1] == 'Wimbledon' and h_team == 'Milton Keynes Dons' and i == 4:
                train_features.loc[h_mask_idx,
                                   'HT_PrevSeasonPos'] = position[0]
            if position[1] == 'Wimbledon' and h_team == 'AFC Wimbledon':
                train_features.loc[h_mask_idx,
                                   'HT_PrevSeasonPos'] = position[0]
            if position[1] == 'Cheltenham Town' and h_team == 'Cheltenham':
                train_features.loc[h_mask_idx,
                                   'HT_PrevSeasonPos'] = position[0]
            if position[1] == h_team:
                # insert prev position if table team = h_team
                train_features.loc[h_mask_idx,
                                   'HT_PrevSeasonPos'] = position[0]
    for a_team in a_teams:
        a_first_mask = first_half['AwayTeam'] == a_team
        a_second_mask = second_half['AwayTeam'] == a_team
        # index of home team in data
        a_first_idx = list(a_first_mask[a_first_mask].index.values)
        a_second_idx = list(a_second_mask[a_second_mask].index.values)
        a_mask_idx = sorted(a_first_idx + a_second_idx)
        for position in table:
            # season before mk dons were formed
            if position[1] == 'Wimbledon' and a_team == 'Milton Keynes Dons' and i == 4:
                train_features.loc[a_mask_idx,
                                   'AT_PrevSeasonPos'] = position[0]
            # wimbledon in dictionary afc wimbledon in data
            if position[1] == 'Wimbledon' and a_team == 'AFC Wimbledon':
                train_features.loc[a_mask_idx,
                                   'AT_PrevSeasonPos'] = position[0]
            # cheltenham town in dict and cheltenham in data
            if position[1] == 'Cheltenham Town' and a_team == 'Cheltenham':
                train_features.loc[a_mask_idx,
                                   'AT_PrevSeasonPos'] = position[0]
            if position[1] == a_team:
                train_features.loc[a_mask_idx,
                                   'AT_PrevSeasonPos'] = position[0]

# save_train_features(train_features)

In [None]:
# invert prev season position for ordinality
# max(num_list) + 1 - x for x in num_list
train_features = load_train_features()

# change dtype from float to int
train_features['HT_PrevSeasonPos'] = train_features['HT_PrevSeasonPos'].astype(
    'int64')
train_features['AT_PrevSeasonPos'] = train_features['AT_PrevSeasonPos'].astype(
    'int64')
# insert inv feat
train_features['HT_PrevSeasonPos_inv'] = 0
train_features['AT_PrevSeasonPos_inv'] = 0

# home team prev season
for i in range(1, max(train_features['HT_PrevSeasonPos']) + 1):  # max = 75
    mask = train_features['HT_PrevSeasonPos'] == i
    mask_idx = mask[mask].index.values
    invert_val = max(train_features['HT_PrevSeasonPos']) + 1 - i
    train_features.loc[mask_idx, 'HT_PrevSeasonPos_inv'] = invert_val

for i in range(1, max(train_features['AT_PrevSeasonPos']) + 1):  # max = 75
    mask = train_features['AT_PrevSeasonPos'] == i
    mask_idx = mask[mask].index.values
    invert_val = max(train_features['AT_PrevSeasonPos']) + 1 - i
    train_features.loc[mask_idx, 'AT_PrevSeasonPos_inv'] = invert_val

# drop original columns
train_features.drop(columns=['HT_PrevSeasonPos',
                             'AT_PrevSeasonPos'], inplace=True)
# save features
save_train_features(train_features)

In [None]:
# log of some features
# exp width bins for away capacity diff
# upper/ lowr quartile indicators for some features
# any transaction features

In [None]:
# box-cox power transform
train_features = load_train_features()

# goals scored 5pg
train_features['bxcx_AATGS5PG'] = boxcox(
    train_features['AATGS5PG'] + 1, lmbda=0.1677360888986696)
train_features['bxcx_AHTGS5PG'] = boxcox(
    train_features['AHTGS5PG'] + 1, lmbda=0.15663078083826068)
# goals conceded 5pg
train_features['bxcx_AHTGC5PG'] = boxcox(
    train_features['AHTGC5PG'] + 1, lmbda=0.37450136993128647)
train_features['bxcx_AATGC5PG'] = boxcox(
    train_features['AATGC5PG'] + 1, lmbda=0.3435345631007946)
# shots on target 5pg
train_features['bxcx_AHTSOT5PG'] = boxcox(
    train_features['AHTSOT5PG'] + 1, lmbda=0.13473011722266465)
train_features['bxcx_AATSOT5PG'] = boxcox(
    train_features['AATSOT5PG'] + 1, lmbda=0.15580637160487532)
# goals scored/point 5pg ratio
train_features['bxcx_AHT_GS_P5PG_ratio'] = boxcox(
    train_features['AHT_GS_P5PG_ratio'] + 1, lmbda=-0.4193818135094843)
train_features['bxcx_AAT_GS_P5PG_ratio'] = boxcox(
    train_features['AAT_GS_P5PG_ratio'] + 1, lmbda=-0.42013380790196936)

save_train_features(train_features)

In [None]:
# upper/ lower quartiles

train_features = load_train_features()

# difference between home/away golas scored 5pg.. upper/lower quartiles
train_features['HA_AHTGS5PG_diff_upqrt'] = 0
train_features['HA_AHTGS5PG_diff_lowqrt'] = 0

up_qrt = train_features['HA_AHTGS5PG_diff'].quantile([.25, .5, .75])[0.75]
low_qrt = train_features['HA_AHTGS5PG_diff'].quantile([.25, .5, .75])[0.25]

up_mask = train_features['HA_AHTGS5PG_diff'] >= up_qrt
low_mask = train_features['HA_AHTGS5PG_diff'] <= low_qrt

up_idx = up_mask[up_mask].index.values
low_idx = low_mask[low_mask].index.values

train_features.loc[up_idx, 'HA_AHTGS5PG_diff_upqrt'] = 1
train_features.loc[low_idx, 'HA_AHTGS5PG_diff_lowqrt'] = 1

# average goals conceded 5pg.. upper/lower quartiles

train_features['AHTGC5PG_upqrt'] = 0
train_features['AHTGC5PG_lowqrt'] = 0
train_features['AATGC5PG_upqrt'] = 0
train_features['AATGC5PG_lowqrt'] = 0

h_up_qrt = train_features['AHTGC5PG'].quantile([.25, .5, .75])[0.75]
h_low_qrt = train_features['AHTGC5PG'].quantile([.25, .5, .75])[0.25]
a_up_qrt = train_features['AATGC5PG'].quantile([.25, .5, .75])[0.75]
a_low_qrt = train_features['AATGC5PG'].quantile([.25, .5, .75])[0.25]

h_up_mask = train_features['AHTGC5PG'] >= h_up_qrt
h_low_mask = train_features['AHTGC5PG'] <= h_low_qrt
a_up_mask = train_features['AATGC5PG'] >= a_up_qrt
a_low_mask = train_features['AATGC5PG'] <= a_low_qrt

h_up_idx = h_up_mask[h_up_mask].index.values
h_low_idx = h_low_mask[h_low_mask].index.values
a_up_idx = a_up_mask[a_up_mask].index.values
a_low_idx = a_low_mask[a_low_mask].index.values

train_features.loc[h_up_idx, 'AHTGC5PG_upqrt'] = 1
train_features.loc[h_low_idx, 'AHTGC5PG_lowqrt'] = 1
train_features.loc[a_up_idx, 'AATGC5PG_upqrt'] = 1
train_features.loc[a_low_idx, 'AATGC5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interaction feature.. ahtgs5pg upper quartile, ahtgc5pg lower quartile
train_features = load_train_features()

train_features['HT_GSGC_UPLOW_QRT'] = 0

up_qrt = train_features['AHTGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGC5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AHTGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AHTGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'HT_GSGC_UPLOW_QRT'] = 1

save_train_features(train_features)

In [None]:
# interaction feature.. aatgs5pg upper quartile, aatgc5pg lower quartile
train_features = load_train_features()

train_features['AT_GSGC_UPLOW_QRT'] = 0

up_qrt = train_features['AATGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGC5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AATGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AATGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AT_GSGC_UPLOW_QRT'] = 1

save_train_features(train_features)

In [None]:
# interaction feature.. ahtgs5pg upper qrt, aatgc5pg lower qrt
train_features = load_train_features()

train_features['AHTGS5PG_upqrt_AATGC5PG_lowqrt'] = 0

up_qrt = train_features['AHTGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGC5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AHTGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AATGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AHTGS5PG_upqrt_AATGC5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interaction feature.. aatgs5pg upper qrt, ahtgc5pg lower qrt
train_features = load_train_features()

train_features['AATGS5PG_upqrt_AHTGC5PG_lowqrt'] = 0

up_qrt = train_features['AATGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGC5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AATGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AHTGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AATGS5PG_upqrt_AHTGC5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. ahtgs5pg upper qrt, aatgs5pg lower qrt
train_features = load_train_features()

train_features['AHTGS5PG_upqrt_AATGS5PG_lowqrt'] = 0

up_qrt = train_features['AHTGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGS5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AHTGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AATGS5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AHTGS5PG_upqrt_AATGS5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. aatgs5pg upper qrt, ahtgs5pg lower qrt
train_features = load_train_features()

train_features['AATGS5PG_upqrt_AHTGS5PG_lowqrt'] = 0

up_qrt = train_features['AATGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGS5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AATGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AHTGS5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AATGS5PG_upqrt_AHTGS5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. ahtgs5phg upper qrt, aatgs5pag lower qrt
train_features = load_train_features()

train_features['AHTGS5PHG_upqrt_AATGS5PAG_lowqrt'] = 0

up_qrt = train_features['AHTGS5PHG'].quantile([.75])[0.75]
low_qrt = train_features['AATGS5PAG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AHTGS5PHG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AATGS5PAG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AHTGS5PHG_upqrt_AATGS5PAG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. aatgs5pag upper qrt, ahtgs5phg lower qrt
train_features = load_train_features()

train_features['AATGS5PAG_upqrt_AHTGS5PHG_lowqrt'] = 0

up_qrt = train_features['AATGS5PAG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGS5PHG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AATGS5PAG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AHTGS5PHG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AATGS5PAG_upqrt_AHTGS5PHG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. ahtsot5pg upper qrt, aatsot5pg lower qrt
train_features = load_train_features()

train_features['AHTSOT5PG_upqrt_AATSOT5PG_lowqrt'] = 0

up_qrt = train_features['AHTSOT5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATSOT5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AHTSOT5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AATSOT5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AHTSOT5PG_upqrt_AATSOT5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. aatsot5pg upper qrt, ahtsot5pg lower qrt
train_features = load_train_features()

train_features['AATSOT5PG_upqrt_AHTSOT5PG_lowqrt'] = 0

up_qrt = train_features['AATSOT5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTSOT5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AATSOT5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AHTSOT5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AATSOT5PG_upqrt_AHTSOT5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. ahtsot5phg upper qrt, aatsot5pag lower qrt
train_features = load_train_features()

train_features['AHTSOT5PHG_upqrt_AATSOT5PAG_lowqrt'] = 0

up_qrt = train_features['AHTSOT5PHG'].quantile([.75])[0.75]
low_qrt = train_features['AATSOT5PAG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AHTSOT5PHG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AATSOT5PAG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AHTSOT5PHG_upqrt_AATSOT5PAG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. aatsot5pag upper qrt, ahtsot5phg lower qrt
train_features = load_train_features()

train_features['AATSOT5PAG_upqrt_AHTSOT5PHG_lowqrt'] = 0

up_qrt = train_features['AATSOT5PAG'].quantile([.75])[0.75]
low_qrt = train_features['AHTSOT5PHG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AATSOT5PAG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AHTSOT5PHG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AATSOT5PAG_upqrt_AHTSOT5PHG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. ahtgc5pg upper qrt, aatgc5pg lower qrt
train_features = load_train_features()

train_features['AHTGC5PG_upqrt_AATGC5PG_lowqrt'] = 0

up_qrt = train_features['AHTGC5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGC5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AHTGC5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AATGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AHTGC5PG_upqrt_AATGC5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# interation feature.. aatgc5pg upper qrt, ahtgc5pg lower qrt
train_features = load_train_features()

train_features['AATGC5PG_upqrt_AHTGC5PG_lowqrt'] = 0

up_qrt = train_features['AATGC5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGC5PG'].quantile([.25])[0.25]

up_qrt_mask = train_features['AATGC5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = train_features['AHTGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

train_features.loc[common_indices, 'AATGC5PG_upqrt_AHTGC5PG_lowqrt'] = 1

save_train_features(train_features)

In [None]:
# awaycapacitydiff_bin below 0.2 quantiles, ahtgs5pg >= 2.2 & ahtgc5pg < 0.8
train_features = load_train_features()
train_features['HTbigcapacitydiff_highgs5pg_lowgc5pg'] = 0

qnt = train_features['AwayCapacityDiff_bin'].quantile(
    [.1, .2, .3, .4, .5, .6, .7, .8, .9])[0.2]
acd_b = train_features[train_features['AwayCapacityDiff_bin'] < qnt]
mask = (acd_b['AHTGS5PG'] >= 2.2) & (acd_b['AHTGC5PG'] < 0.8)
mask_idx = mask[mask].index.values
train_features.loc[mask_idx, 'HTbigcapacitydiff_highgs5pg_lowgc5pg'] = 1

save_train_features(train_features)

In [None]:
# power and quantile normality transformations of continuous features
train_features = load_train_features()

box_cox = ['AHTSOT5PG', 'AATSOT5PG', 'AHTSOT5PHG',
           'AATSOT5PAG', 'AHTP5PG', 'AHTGS_SOT5PHG_ratio']
quant = ['AHTGS5PG', 'AATGS5PG', 'AHTGC5PG', 'AATGC5PG', 'AHTGS5PHG', 'AATGS5PAG', 'AHTGC5PHG', 'AATGC5PAG', 'AHTGS_SOT5PG_ratio',
         'AATGS_SOT5PG_ratio', 'AATGS_SOT5PAG_ratio', 'AHTGD5PG', 'AATGD5PG', 'AHTGD5PHG', 'AATGD5PAG', 'HA_AHTGS5PG_diff',
         'HA_ATP5PG_diff', 'AHT_GS_P5PG_ratio', 'AAT_GS_P5PG_ratio', 'AwayTeamDist', 'AwayCapacityDiff_bin', 'AwayTeamDist_bin',
         'bxcx_AATGS5PG', 'bxcx_AHTGS5PG', 'bxcx_AHTGC5PG', 'bxcx_AHTSOT5PG', 'bxcx_AATSOT5PG', 'bxcx_AHT_GS_P5PG_ratio',
         'bxcx_AAT_GS_P5PG_ratio', 'bxcx_AATGC5PG']

bc_pt = PowerTransformer(method='box-cox')
qt = QuantileTransformer(
    n_quantiles=1000, output_distribution='normal', random_state=1)

for col in train_features:
    if col in box_cox:
        train_features[f'{col}_bxcx_pwrTRANSFORM'] = bc_pt.fit_transform(
            np.array(train_features[col] + 1).reshape(-1, 1))
    elif col in quant:
        train_features[f'{col}_quantileTRANSFORM'] = qt.fit_transform(
            np.array(train_features[col]).reshape(-1, 1))

save_train_features(train_features)

In [None]:
""" all feature engineering above executed and saved """

## Test Features

In [None]:
# seperate month from date
test_features = load_test_features()
month = []
for i in range(len(test_features)):
    date = test_features['Date'][i]
    month.append(int(date.split('/')[1]))

test_features['month'] = pd.Series(data=month)

save_test_features(test_features)

In [None]:
# seperate year from date
test_features = load_test_features()
year = []
for i in range(len(test_features)):
    date = test_features['Date'][i]
    if len(date.split('/')[2]) <= 2:  # 2 digits for year
        year.append(datetime.datetime.strptime(
            test_features['Date'][i], '%d/%m/%y').strftime('%Y'))
    else:  # 4 digits for year
        year.append(datetime.datetime.strptime(
            test_features['Date'][i], '%d/%m/%Y').strftime('%Y'))

test_features['year'] = pd.Series(data=year)
save_test_features(test_features)

In [None]:
# day of week... 0 = monday, 6 = sunday
test_features = load_test_features()
date = pd.to_datetime(test_features['Date'], dayfirst=True)
test_features['DayofWeek'] = date.dt.dayofweek

save_test_features(test_features)

In [None]:
# season month
# start august
test_features = load_test_features()
test_features['season_month'] = 0
s_mnth_order = [8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7]
count = 0
for mnth in s_mnth_order:
    count += 1
    fm = test_features['month'] == mnth
    fm_idx = fm[fm].index.values
    test_features.loc[fm_idx, 'season_month'] = count

save_test_features(test_features)

In [None]:
# extract cyclical nature from season month and day of week
# month cyclical
test_features = load_test_features()
test_features['season_month_sin'] = np.sin(
    2*np.pi*test_features.season_month/12)
test_features['season_month_cos'] = np.cos(
    2*np.pi*test_features.season_month/12)
# day of week cyclical
test_features['DayofWeek_sin'] = np.sin(2*np.pi*test_features.DayofWeek/7)
test_features['DayofWeek_cos'] = np.cos(2*np.pi*test_features.DayofWeek/7)

save_test_features(test_features)

In [None]:
# average goals scored to shots on target ratio (goals scored / shots on target)
test_features = load_test_features()
# 5 previous games for home team and away team
test_features['AHTGS_SOT5PG_ratio'] = np.round(
    test_features.AHTGS5PG / test_features.AHTSOT5PG, 2)
test_features['AATGS_SOT5PG_ratio'] = np.round(
    test_features.AATGS5PG / test_features.AATSOT5PG, 2)

# 5 previous home games for home team
test_features['AHTGS_SOT5PHG_ratio'] = np.round(
    test_features.AHTGS5PHG / test_features.AHTSOT5PHG, 2)

# 5 previous away games for away team
test_features['AATGS_SOT5PAG_ratio'] = np.round(
    test_features.AATGS5PAG / test_features.AATSOT5PAG, 2)

save_test_features(test_features)

In [None]:
# average goal difference (goals scored - goals conceded)
test_features = load_test_features()
# 5 previous games for home team and away team
test_features['AHTGD5PG'] = test_features.AHTGS5PG - test_features.AHTGC5PG
test_features['AATGD5PG'] = test_features.AATGS5PG - test_features.AATGC5PG

# 5 previous home games for home team
test_features['AHTGD5PHG'] = test_features.AHTGS5PHG - test_features.AHTGC5PHG

# 5 previous away games for away team
test_features['AATGD5PAG'] = test_features.AATGS5PAG - test_features.AATGC5PAG

save_test_features(test_features)

In [None]:
# remove date column
test_features = load_test_features()
test_features.drop(columns=['Date'], inplace=True)
save_test_features(test_features)

In [None]:
# convert division into ordinal numbers
test_features = load_test_features()
# premier league
prem = test_features['Div'] == 'E0'
prem_idx = prem[prem].index.values
test_features.loc[prem_idx, 'Div'] = 2
# championship
champ = test_features['Div'] == 'E1'
champ_idx = champ[champ].index.values
test_features.loc[champ_idx, 'Div'] = 1
# league 1
leag1 = test_features['Div'] == 'E2'
leag1_idx = leag1[leag1].index.values
test_features.loc[leag1_idx, 'Div'] = 0

save_test_features(test_features)

In [None]:
test_features['Div'] = test_features['Div'].astype('int64')
save_test_features(test_features)

In [None]:
# values above/below continuous outlier bounds
test_features = load_test_features()
train_features = load_train_features()

# remove wrong outlier columns from test features
for f in test_features.columns:
    if 'UPoutlier' in f or 'LOWoutlier' in f:
        test_features.drop(columns=[f], inplace=True)

# get correct columns for outlier features from train features
# get outlier features
of = [f for f in train_features.columns if 'UPoutlier' in f or 'LOWoutlier' in f]
of = [f.replace('_UPoutlier', '')
      for f in of]  # remove text to get original feature
of = [f.replace('_LOWoutlier', '')
      for f in of]  # remove text to get original feature
of = [f for f in set(of)]  # get unique features

for key in of:
    # if df has values above upper outlier bound
    if np.sum(test_features[key] > outlier_dict[key][0]):
        # check if the same column exists in train features
        if f'{key}_UPoutlier' in train_features.columns:
            test_features[f'{key}_UPoutlier'] = 0
            mask = test_features[key] > outlier_dict[key][0]
            upper_idx = mask[mask].index.values
            test_features.loc[upper_idx, f'{key}_UPoutlier'] = 1
    # if df doesnt have values above create columns of 0's
    if not np.sum(test_features[key] > outlier_dict[key][0]):
        if f'{key}_UPoutlier' in train_features.columns:
            test_features[f'{key}_UPoutlier'] = 0
    # if df has values below lower outlier bound
    if np.sum(test_features[key] < outlier_dict[key][1]):
        if f'{key}_LOWoutlier' in train_features.columns:
            test_features[f'{key}_LOWoutlier'] = 0
            mask = test_features[key] < outlier_dict[key][1]
            lower_idx = mask[mask].index.values
            test_features.loc[lower_idx, f'{key}_LOWoutlier'] = 1
    # if df doesnt have values below create columns of 0's
    if not np.sum(test_features[key] < outlier_dict[key][1]):
        if f'{key}_LOWoutlier' in train_features.columns:
            test_features[f'{key}_LOWoutlier'] = 0
    # if key == 'AATGD5PAG':
    #    break

# drop outliers for features we cant use
# non_feats = ['FTHG_UPoutlier','FTAG_UPoutlier','HS_UPoutlier','AS_UPoutlier','HST_UPoutlier','AST_UPoutlier',
#             'HF_UPoutlier','HF_LOWoutlier','AF_UPoutlier','HC_UPoutlier','AC_UPoutlier','HY_UPoutlier','AY_UPoutlier',
#             'HR_UPoutlier','AR_UPoutlier']
#test_features.drop(columns = non_feats, inplace = True)


# save test_features
save_test_features(test_features)

In [None]:
# home/away team AHTGS5PG difference
test_features = load_test_features()
test_features['HA_AHTGS5PG_diff'] = test_features['AHTGS5PG'] - \
    test_features['AATGS5PG']

# save train_features
save_test_features(test_features)

In [None]:
# home/away team AHTP5PG difference
test_features = load_test_features()
test_features['HA_ATP5PG_diff'] = test_features['AHTP5PG'] - \
    test_features['AATP5PG']

# save train_features
save_test_features(test_features)

In [None]:
# average goals scored to average points ratio
test_features = load_test_features()
#test_features.drop(columns = ['AHT_GS_P5PG_ratio', 'AAT_GS_P5PG_ratio'], inplace = True)
# home team
test_features['AHT_GS_P5PG_ratio'] = (
    test_features['AHTGS5PG'] + 1) / (test_features['AHTP5PG'] + 1)
# away team
test_features['AAT_GS_P5PG_ratio'] = (
    test_features['AATGS5PG'] + 1) / (test_features['AATP5PG'] + 1)

# save train_features
save_test_features(test_features)

In [None]:
# home - away football ground distance
test_features = load_test_features()
test_features['AwayTeamDist'] = np.nan
geolocator = Nominatim(user_agent="football_ground_distance")

team_geo_loc_dict = {}
for team in sorted(test_features['HomeTeam'].value_counts().index):
    t = get_football_ground(team)
    t_loc = geolocator.geocode(t[0])
    t_geo_loc = (t_loc.latitude, t_loc.longitude)
    team_geo_loc_dict[team] = t_geo_loc


for h_team in sorted(test_features['HomeTeam'].value_counts().index):
    away_teams = list(
        set(test_features[(test_features['HomeTeam'] == h_team)]['AwayTeam'].values))
    for a_team in away_teams:
        h_team_loc = team_geo_loc_dict.get(h_team)
        a_team_loc = team_geo_loc_dict.get(a_team)
        a_team_dist = np.round(geodesic(h_team_loc, a_team_loc).miles, 2)
        mask = (test_features['HomeTeam'] == h_team) & (
            test_features['AwayTeam'] == a_team)
        g_idx = mask[mask].index.values
        test_features.loc[g_idx, 'AwayTeamDist'] = a_team_dist

save_test_features(test_features)

In [None]:
# home/ away stadium capacity difference
test_features = load_test_features()

test_features['AwayCapacityDiff'] = np.nan

for h_team in sorted(test_features['HomeTeam'].value_counts().index):
    away_teams = list(
        set(test_features[(test_features['HomeTeam'] == h_team)]['AwayTeam'].values))
    for a_team in away_teams:
        htg = get_football_ground(h_team)
        atg = get_football_ground(a_team)
        cap_diff = atg[1] - htg[1]
        mask = (test_features['HomeTeam'] == h_team) & (
            test_features['AwayTeam'] == a_team)
        g_idx = mask[mask].index.values
        test_features.loc[g_idx, 'AwayCapacityDiff'] = cap_diff

save_test_features(test_features)

In [None]:
# bin away ground distance
test_features = load_test_features()
test_features['AwayTeamDist_bin'] = np.floor_divide(
    test_features['AwayTeamDist'], 10)

save_test_features(test_features)

In [None]:
# bin capacity difference
test_features = load_test_features()
test_features['AwayCapacityDiff_bin'] = np.floor_divide(
    test_features['AwayCapacityDiff'], 1000)

save_test_features(test_features)

In [None]:
# local derby - distance 10 or below
test_features = load_test_features()
test_features['Local_Derby'] = 0

mask = test_features['AwayTeamDist'] <= 10.0
ld_idx = mask[mask].index.values
test_features.loc[ld_idx, 'Local_Derby'] = 1

save_test_features(test_features)

In [None]:
# distance > = 100
test_features = load_test_features()
test_features['Dist>=100'] = 0

mask = test_features['AwayTeamDist'] >= 100
d_idx = mask[mask].index.values
test_features.loc[d_idx, 'Dist>=100'] = 1

save_test_features(test_features)

In [None]:
# previous end of season position

# load dictionary
dict_file = open('season_dictionary.pkl', 'rb')
season_dict = pickle.load(dict_file)
dict_file.close()
# load test features
test_features = load_test_features()
# create features for new data
test_features['HT_PrevSeasonPos'] = np.nan
test_features['AT_PrevSeasonPos'] = np.nan

for i in range(0, 21):
    if i == 0:
        # get dict season.. change for dif vals of i!!
        season = season_dict.get(f'99/0{i}')
        first_half = test_features[(test_features['year'] == int(f'200{i}')) & (
            test_features['month'] >= 8)]  # first half season
        second_half = test_features[(test_features['year'] == int(
            f'200{i+1}')) & (test_features['month'] < 8)]  # second half season
    elif 0 < i < 10:
        season = season_dict.get(f'0{i-1}/0{i}')
        first_half = test_features[(test_features['year'] == int(f'200{i}')) & (
            test_features['month'] >= 8)]  # first half season
        if i == 9:
            second_half = test_features[(test_features['year'] == int(
                f'20{i+1}')) & (test_features['month'] < 8)]  # second half season
        else:
            second_half = test_features[(test_features['year'] == int(
                f'200{i+1}')) & (test_features['month'] < 8)]  # second half season
    elif i == 10:
        season = season_dict.get(f'0{i-1}/{i}')
        first_half = test_features[(test_features['year'] == int(f'20{i}')) & (
            test_features['month'] >= 8)]  # first half season
        second_half = test_features[(test_features['year'] == int(
            f'20{i+1}')) & (test_features['month'] < 8)]  # second half season
    else:
        season = season_dict.get(f'{i-1}/{i}')
        first_half = test_features[(test_features['year'] == int(f'20{i}')) & (
            test_features['month'] >= 8)]  # first half season
        second_half = test_features[(test_features['year'] == int(
            f'20{i+1}')) & (test_features['month'] < 8)]  # second half season

    table = season[0][1:]  # prem league but appending all standings into this
    champ = season[1][1:]
    le1 = season[2][1:]  # join each season to make full standings df
    le2 = season[3][1:]
    for l in [champ, le1, le2]:
        for standing in l:
            table.append(standing)
    full_season = first_half.append(
        second_half, ignore_index=True)  # first and second joined
    h_teams = sorted(
        full_season['HomeTeam'].value_counts().index)  # home teams
    a_teams = sorted(
        full_season['AwayTeam'].value_counts().index)  # away teams
    for h_team in h_teams:
        #     get masks for team in dictionary from data
        h_first_mask = first_half['HomeTeam'] == h_team
        h_second_mask = second_half['HomeTeam'] == h_team
        # index of home team in data
        h_first_idx = list(h_first_mask[h_first_mask].index.values)
        h_second_idx = list(h_second_mask[h_second_mask].index.values)
        h_mask_idx = sorted(h_first_idx + h_second_idx)
#     get position from team in dictionary into new column
        for position in table:
            if position[1] == 'Wimbledon' and h_team == 'Milton Keynes Dons' and i == 4:
                test_features.loc[h_mask_idx, 'HT_PrevSeasonPos'] = position[0]
            if position[1] == 'Wimbledon' and h_team == 'AFC Wimbledon':
                test_features.loc[h_mask_idx, 'HT_PrevSeasonPos'] = position[0]
            if position[1] == 'Cheltenham Town' and h_team == 'Cheltenham':
                test_features.loc[h_mask_idx, 'HT_PrevSeasonPos'] = position[0]
            if position[1] == h_team:
                # insert prev position if table team = h_team
                test_features.loc[h_mask_idx, 'HT_PrevSeasonPos'] = position[0]
    for a_team in a_teams:
        a_first_mask = first_half['AwayTeam'] == a_team
        a_second_mask = second_half['AwayTeam'] == a_team
        # index of home team in data
        a_first_idx = list(a_first_mask[a_first_mask].index.values)
        a_second_idx = list(a_second_mask[a_second_mask].index.values)
        a_mask_idx = sorted(a_first_idx + a_second_idx)
        for position in table:
            # season before mk dons were formed
            if position[1] == 'Wimbledon' and a_team == 'Milton Keynes Dons' and i == 4:
                test_features.loc[a_mask_idx, 'AT_PrevSeasonPos'] = position[0]
            # wimbledon in dictionary afc wimbledon in data
            if position[1] == 'Wimbledon' and a_team == 'AFC Wimbledon':
                test_features.loc[a_mask_idx, 'AT_PrevSeasonPos'] = position[0]
            # cheltenham town in dict and cheltenham in data
            if position[1] == 'Cheltenham Town' and a_team == 'Cheltenham':
                test_features.loc[a_mask_idx, 'AT_PrevSeasonPos'] = position[0]
            if position[1] == a_team:
                test_features.loc[a_mask_idx, 'AT_PrevSeasonPos'] = position[0]

save_test_features(test_features)

In [None]:
# invert prev season position for ordinality
# max(num_list) + 1 - x for x in num_list
test_features = load_test_features()

# change dtype from float to int
test_features['HT_PrevSeasonPos'] = test_features['HT_PrevSeasonPos'].astype(
    'int64')
test_features['AT_PrevSeasonPos'] = test_features['AT_PrevSeasonPos'].astype(
    'int64')
# insert inv feat
test_features['HT_PrevSeasonPos_inv'] = 0
test_features['AT_PrevSeasonPos_inv'] = 0

# home team prev season
for i in range(1, max(test_features['HT_PrevSeasonPos']) + 1):  # max = 75
    mask = test_features['HT_PrevSeasonPos'] == i
    mask_idx = mask[mask].index.values
    invert_val = max(test_features['HT_PrevSeasonPos']) + 1 - i
    test_features.loc[mask_idx, 'HT_PrevSeasonPos_inv'] = invert_val

for i in range(1, max(test_features['AT_PrevSeasonPos']) + 1):  # max = 75
    mask = test_features['AT_PrevSeasonPos'] == i
    mask_idx = mask[mask].index.values
    invert_val = max(test_features['AT_PrevSeasonPos']) + 1 - i
    test_features.loc[mask_idx, 'AT_PrevSeasonPos_inv'] = invert_val

# drop original columns
test_features.drop(columns=['HT_PrevSeasonPos',
                            'AT_PrevSeasonPos'], inplace=True)
# save features
save_test_features(test_features)

In [None]:
# box-cox power transform
test_features = load_test_features()

# goals scored 5pg
test_features['bxcx_AATGS5PG'] = boxcox(
    test_features['AATGS5PG'] + 1, lmbda=0.1677360888986696)
test_features['bxcx_AHTGS5PG'] = boxcox(
    test_features['AHTGS5PG'] + 1, lmbda=0.15663078083826068)
# goals conceded 5pg
test_features['bxcx_AHTGC5PG'] = boxcox(
    test_features['AHTGC5PG'] + 1, lmbda=0.37450136993128647)
test_features['bxcx_AATGC5PG'] = boxcox(
    test_features['AATGC5PG'] + 1, lmbda=0.3435345631007946)
# shots on target 5pg
test_features['bxcx_AHTSOT5PG'] = boxcox(
    test_features['AHTSOT5PG'] + 1, lmbda=0.13473011722266465)
test_features['bxcx_AATSOT5PG'] = boxcox(
    test_features['AATSOT5PG'] + 1, lmbda=0.15580637160487532)
# goals scored/point 5pg ratio
test_features['bxcx_AHT_GS_P5PG_ratio'] = boxcox(
    test_features['AHT_GS_P5PG_ratio'] + 1, lmbda=-0.4193818135094843)
test_features['bxcx_AAT_GS_P5PG_ratio'] = boxcox(
    test_features['AAT_GS_P5PG_ratio'] + 1, lmbda=-0.42013380790196936)

save_test_features(test_features)

In [None]:
# upper/ lower quartiles

test_features = load_test_features()
train_features = load_train_features()

# difference between home/away golas scored 5pg.. upper/lower quartiles
test_features['HA_AHTGS5PG_diff_upqrt'] = 0
test_features['HA_AHTGS5PG_diff_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['HA_AHTGS5PG_diff'].quantile([.25, .5, .75])[0.75]
low_qrt = train_features['HA_AHTGS5PG_diff'].quantile([.25, .5, .75])[0.25]

up_mask = test_features['HA_AHTGS5PG_diff'] >= up_qrt
low_mask = test_features['HA_AHTGS5PG_diff'] <= low_qrt

up_idx = up_mask[up_mask].index.values
low_idx = low_mask[low_mask].index.values

test_features.loc[up_idx, 'HA_AHTGS5PG_diff_upqrt'] = 1
test_features.loc[low_idx, 'HA_AHTGS5PG_diff_lowqrt'] = 1

# average goals conceded 5pg.. upper/lower quartiles

test_features['AHTGC5PG_upqrt'] = 0
test_features['AHTGC5PG_lowqrt'] = 0
test_features['AATGC5PG_upqrt'] = 0
test_features['AATGC5PG_lowqrt'] = 0
# get upper/lower quantiles from train data
h_up_qrt = train_features['AHTGC5PG'].quantile([.25, .5, .75])[0.75]
h_low_qrt = train_features['AHTGC5PG'].quantile([.25, .5, .75])[0.25]
a_up_qrt = train_features['AATGC5PG'].quantile([.25, .5, .75])[0.75]
a_low_qrt = train_features['AATGC5PG'].quantile([.25, .5, .75])[0.25]

h_up_mask = test_features['AHTGC5PG'] >= h_up_qrt
h_low_mask = test_features['AHTGC5PG'] <= h_low_qrt
a_up_mask = test_features['AATGC5PG'] >= a_up_qrt
a_low_mask = test_features['AATGC5PG'] <= a_low_qrt

h_up_idx = h_up_mask[h_up_mask].index.values
h_low_idx = h_low_mask[h_low_mask].index.values
a_up_idx = a_up_mask[a_up_mask].index.values
a_low_idx = a_low_mask[a_low_mask].index.values

test_features.loc[h_up_idx, 'AHTGC5PG_upqrt'] = 1
test_features.loc[h_low_idx, 'AHTGC5PG_lowqrt'] = 1
test_features.loc[a_up_idx, 'AATGC5PG_upqrt'] = 1
test_features.loc[a_low_idx, 'AATGC5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interaction feature.. ahtgs5pg upper quartile, ahtgc5pg lower quartile
test_features = load_test_features()
train_features = load_train_features()

test_features['HT_GSGC_UPLOW_QRT'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AHTGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGC5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AHTGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AHTGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'HT_GSGC_UPLOW_QRT'] = 1

save_test_features(test_features)

In [None]:
# interaction feature.. aatgs5pg upper quartile, aatgc5pg lower quartile
test_features = load_test_features()
train_features = load_train_features()

test_features['AT_GSGC_UPLOW_QRT'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AATGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGC5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AATGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AATGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AT_GSGC_UPLOW_QRT'] = 1

save_test_features(test_features)

In [None]:
# interaction feature.. ahtgs5pg upper qrt, aatgc5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AHTGS5PG_upqrt_AATGC5PG_lowqrt'] = 0

up_qrt = train_features['AHTGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGC5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AHTGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AATGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AHTGS5PG_upqrt_AATGC5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interaction feature.. aatgs5pg upper qrt, ahtgc5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AATGS5PG_upqrt_AHTGC5PG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AATGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGC5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AATGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AHTGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AATGS5PG_upqrt_AHTGC5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. ahtgs5pg upper qrt, aatgs5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AHTGS5PG_upqrt_AATGS5PG_lowqrt'] = 0
# get upper/lowwr quantiles from train data
up_qrt = train_features['AHTGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGS5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AHTGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AATGS5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AHTGS5PG_upqrt_AATGS5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. aatgs5pg upper qrt, ahtgs5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AATGS5PG_upqrt_AHTGS5PG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AATGS5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGS5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AATGS5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AHTGS5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AATGS5PG_upqrt_AHTGS5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. ahtgs5phg upper qrt, aatgs5pag lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AHTGS5PHG_upqrt_AATGS5PAG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AHTGS5PHG'].quantile([.75])[0.75]
low_qrt = train_features['AATGS5PAG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AHTGS5PHG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AATGS5PAG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AHTGS5PHG_upqrt_AATGS5PAG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. aatgs5pag upper qrt, ahtgs5phg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AATGS5PAG_upqrt_AHTGS5PHG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AATGS5PAG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGS5PHG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AATGS5PAG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AHTGS5PHG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AATGS5PAG_upqrt_AHTGS5PHG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. ahtsot5pg upper qrt, aatsot5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AHTSOT5PG_upqrt_AATSOT5PG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AHTSOT5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATSOT5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AHTSOT5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AATSOT5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AHTSOT5PG_upqrt_AATSOT5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. aatsot5pg upper qrt, ahtsot5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AATSOT5PG_upqrt_AHTSOT5PG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AATSOT5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTSOT5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AATSOT5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AHTSOT5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AATSOT5PG_upqrt_AHTSOT5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. ahtsot5phg upper qrt, aatsot5pag lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AHTSOT5PHG_upqrt_AATSOT5PAG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AHTSOT5PHG'].quantile([.75])[0.75]
low_qrt = train_features['AATSOT5PAG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AHTSOT5PHG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AATSOT5PAG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AHTSOT5PHG_upqrt_AATSOT5PAG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. aatsot5pag upper qrt, ahtsot5phg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AATSOT5PAG_upqrt_AHTSOT5PHG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AATSOT5PAG'].quantile([.75])[0.75]
low_qrt = train_features['AHTSOT5PHG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AATSOT5PAG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AHTSOT5PHG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AATSOT5PAG_upqrt_AHTSOT5PHG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. ahtgc5pg upper qrt, aatgc5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AHTGC5PG_upqrt_AATGC5PG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AHTGC5PG'].quantile([.75])[0.75]
low_qrt = train_features['AATGC5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AHTGC5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AATGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AHTGC5PG_upqrt_AATGC5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# interation feature.. aatgc5pg upper qrt, ahtgc5pg lower qrt
test_features = load_test_features()
train_features = load_train_features()

test_features['AATGC5PG_upqrt_AHTGC5PG_lowqrt'] = 0
# get upper/lower quantiles from train data
up_qrt = train_features['AATGC5PG'].quantile([.75])[0.75]
low_qrt = train_features['AHTGC5PG'].quantile([.25])[0.25]

up_qrt_mask = test_features['AATGC5PG'] >= up_qrt
up_qrt_mask_idx = up_qrt_mask[up_qrt_mask].index.values
low_qrt_mask = test_features['AHTGC5PG'] <= low_qrt
low_qrt_mask_idx = low_qrt_mask[low_qrt_mask].index.values

up_qrt_mask_set = set(up_qrt_mask_idx)
common_indices = list(sorted(up_qrt_mask_set.intersection(low_qrt_mask_idx)))

test_features.loc[common_indices, 'AATGC5PG_upqrt_AHTGC5PG_lowqrt'] = 1

save_test_features(test_features)

In [None]:
# awaycapacitydiff_bin below 0.2 quantiles, ahtgs5pg >= 2.2 & ahtgc5pg < 0.8
test_features = load_test_features()
train_features = load_train_features()
test_features['HTbigcapacitydiff_highgs5pg_lowgc5pg'] = 0
# get lower quantile from train data
qnt = train_features['AwayCapacityDiff_bin'].quantile(
    [.1, .2, .3, .4, .5, .6, .7, .8, .9])[0.2]
acd_b = test_features[test_features['AwayCapacityDiff_bin'] < qnt]
mask = (acd_b['AHTGS5PG'] >= 2.2) & (acd_b['AHTGC5PG'] < 0.8)
mask_idx = mask[mask].index.values
test_features.loc[mask_idx, 'HTbigcapacitydiff_highgs5pg_lowgc5pg'] = 1

save_test_features(test_features)

In [None]:
# power and quantile normality transformations of continuous features
test_features = load_test_features()
train_features = load_train_features()
# drop columns from previous execution
for col in test_features.columns:
    if 'TRANSFORM' in col:
        test_features.drop(columns=[col], inplace=True)

box_cox = ['AHTSOT5PG', 'AATSOT5PG', 'AHTSOT5PHG',
           'AATSOT5PAG', 'AHTP5PG', 'AHTGS_SOT5PHG_ratio']
quant = ['AHTGS5PG', 'AATGS5PG', 'AHTGC5PG', 'AATGC5PG', 'AHTGS5PHG', 'AATGS5PAG', 'AHTGC5PHG', 'AATGC5PAG', 'AHTGS_SOT5PG_ratio',
         'AATGS_SOT5PG_ratio', 'AATGS_SOT5PAG_ratio', 'AHTGD5PG', 'AATGD5PG', 'AHTGD5PHG', 'AATGD5PAG', 'HA_AHTGS5PG_diff',
         'HA_ATP5PG_diff', 'AHT_GS_P5PG_ratio', 'AAT_GS_P5PG_ratio', 'AwayTeamDist', 'AwayCapacityDiff_bin', 'AwayTeamDist_bin',
         'bxcx_AATGS5PG', 'bxcx_AHTGS5PG', 'bxcx_AHTGC5PG', 'bxcx_AHTSOT5PG', 'bxcx_AATSOT5PG', 'bxcx_AHT_GS_P5PG_ratio',
         'bxcx_AAT_GS_P5PG_ratio', 'bxcx_AATGC5PG']

for col in test_features:
    bc_pt = PowerTransformer(method='box-cox')
    qt = QuantileTransformer(
        n_quantiles=1000, output_distribution='normal', random_state=1)
    if col in box_cox:
        # fit on train data, transform test data
        bc_pt.fit(np.array(train_features[col] + 1).reshape(-1, 1))
        test_features[f'{col}_bxcx_pwrTRANSFORM'] = bc_pt.transform(
            np.array(test_features[col] + 1).reshape(-1, 1))
    elif col in quant:
        # fit on train data, transform test data
        qt.fit(np.array(train_features[col]).reshape(-1, 1))
        test_features[f'{col}_quantileTRANSFORM'] = qt.transform(
            np.array(test_features[col]).reshape(-1, 1))

save_test_features(test_features)

In [None]:
""" all feature engineering above executed and saved """

## - Summary -

Types of features engineered; 

date extraction (days, months, years), cyclical nature (days, months, years), ratios, differences, conversions, indicators for upper and lower outliers, binned features, ordinality inversion, indicators for less than or greater than, box-cox power transforms, upper/lower quartile indicators, feature interactions, quantile indicators and normality transforms.

# -- Baseline --

In [None]:
# home baseline test
home_train_features = load_home_train_features_with_drop()
home_train_features = football_data_team_ohe(home_train_features)
sc = StandardScaler()
home_train_features = sc.fit_transform(home_train_features)
home_train_target, _ = load_home_targets()
model = DummyClassifier(strategy='constant', constant=0)
scores = evaluate_model(home_train_features,
                        home_train_target, model, k_splits=10)
print(f'Home baseline Fbeta Score: {np.mean(scores)}, Std: {np.std(scores)}')

In [None]:
# away baseline test
away_train_features = load_away_train_features_with_drop()
away_train_features = football_data_team_ohe(away_train_features)
sc = StandardScaler()
away_train_features = sc.fit_transform(away_train_features)
away_train_target, _ = load_away_targets()
model = DummyClassifier(strategy='constant', constant=0)
scores = evaluate_model(away_train_features,
                        away_train_target, model, k_splits=10)
print(f'Away baseline Fbeta Score: {np.mean(scores)}, Std: {np.std(scores)}')

In [None]:
# home data simple model
home_train_features = load_home_train_features_with_drop()
home_best_feat = home_train_features['AwayCapacityDiff']
home_train_target, _ = load_home_targets()
model = LogisticRegression(class_weight='balanced')
scores = evaluate_model(np.array(home_best_feat).reshape(-1, 1),
                        home_train_target, model, k_splits=10)
print(
    f'Home simple model avg Fbeta score: {np.mean(scores)}, Std: {np.std(scores)}')

In [None]:
# away data simple model
away_train_features = load_away_train_features_with_drop()
away_best_feat = away_train_features['AwayCapacityDiff']
away_train_target, _ = load_away_targets()
model = LogisticRegression(class_weight='balanced')
scores = evaluate_model(np.array(away_best_feat).reshape(-1, 1),
                        away_train_target, model, k_splits=10)
print(
    f'AWay simple model avg Fbeta score: {np.mean(scores)}, Std: {np.std(scores)}')

## - Summary -

The fbeta metric is setup to provide slightly more weight to maximise recall (minimise false negatives) which can be seen in this domain to allow the models to be more risk accepting (decrease false negatives increase false positives (trade off)), which in english is reduce the number of no win predictions when it should be win but with a non negotiable trade off of inccurring more win predictions when it should be no win therefore allowing more risk. If we were to maximise precision (minimise false positives) this can be seen as making the model more conservative (decrease false positives increase false negatives (trade off)), which in english is reducing the number of win predictions when it should be no win but with a non negotiable trade off of inccurring more no win predictions when it should be win therefore a more conservative model. 

Fbeta for the models being tested is weighted .2 above the harmonic mean. Altering this weighting could provide better or worse performing models for this domain and should be looked into when rerunnng analaysis.

Home and away fbeta score baselines for predicting majority class:

    Home fbeta baseline: 0.4183104482317975
    Away fbeta baseline: 0.6102523359914978
    
Home and away simple model fbeta scores, where only the best feature is used with a logistic regression model:

    Home simple model Fbeta score: 0.555328458299842
    Away simple model Fbeta score: 0.5644830306923153
    
It can be seen that the home model performance increases substantially when the top feature is used to predict but the away model performance decreases some what. This performance decrease could be due to the bigger class imbalance for the away model with only an approx. 29% away win proportion compared to approx. 45% home win proportion. The decrease could be due to the away simple model predicting more false positives (predicting away win when its actually away no win (lose or draw)). This can be alleviated by performing oversampling and undersampling methods.

# -- Modelling --

## Home Features

In [None]:
# home model test with home/away teams
home_train_features = load_home_train_features_with_drop()
home_train_features = football_data_team_ohe(home_train_features)
home_train_target, _ = load_home_targets()
sc = StandardScaler()

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(home_train_features,
                            home_train_target, pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# home model test without home/away teams
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
sc = StandardScaler(with_mean=False)

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    print(f'Running {ml_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(home_train_features, np.array(
        home_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# hypothesis test for ml models with home data
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
sc = StandardScaler(with_mean=False)
sc.fit_transform(home_train_features)

fbeta_metric = make_scorer(fbeta)
mod_names = ['GNB', 'GB', 'AB', 'XGB', 'RF']
ml_models, ml_names = get_ml_models()


for i in range(len(ml_models)):
    if ml_names[i] == 'LR':
        for j in range(len(ml_models)):
            if ml_names[j] in mod_names:

                t, p = paired_ttest_5x2cv(
                    estimator1=ml_models[i],
                    estimator2=ml_models[j],
                    X=home_train_features,
                    y=np.array(home_train_target).reshape(-1,),
                    scoring=fbeta_metric,
                    random_seed=1
                )
                if p <= 0.05:
                    print(
                        f'{ml_names[i]} & {ml_names[j]}: Difference in model performance is most likely real')
                else:
                    print(
                        f'{ml_names[i]} & {ml_names[j]}: Both models are most likely to have the same performance')

In [None]:
# home model test without home/away teams and transformed continuous data
home_train_features = load_home_train_features_with_drop_transformed()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
sc = StandardScaler(with_mean=False)

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    print(f'Running {ml_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(home_train_features, np.array(
        home_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# hypothesis test for ml models with home data transformed
home_train_features = load_home_train_features_with_drop_transformed()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
sc = StandardScaler(with_mean=False)
sc.fit_transform(home_train_features)

fbeta_metric = make_scorer(fbeta)
mod_names = ['GNB', 'GB', 'AB', 'XGB', 'RF']
ml_models, ml_names = get_ml_models()


for i in range(len(ml_models)):
    if ml_names[i] == 'LR':
        for j in range(len(ml_models)):
            if ml_names[j] in mod_names:

                t, p = paired_ttest_5x2cv(
                    estimator1=ml_models[i],
                    estimator2=ml_models[j],
                    X=home_train_features,
                    y=np.array(home_train_target).reshape(-1,),
                    scoring=fbeta_metric,
                    random_seed=1
                )
                if p <= 0.05:
                    print(
                        f'{ml_names[i]} & {ml_names[j]}: Difference in model performance is most likely real')
                else:
                    print(
                        f'{ml_names[i]} & {ml_names[j]}: Both models are most likely to have the same performance')

In [None]:
# catboost and lightgbm
# home model test without home/away teams
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
sc = StandardScaler(with_mean=False)

cb = CatBoostClassifier(n_estimators=200, learning_rate=0.1, random_seed=42)
lgbmc = LGBMClassifier(n_estimators=200, learning_rate=0.1, random_state=42)
xgb = XGBClassifier(n_estimators=200, learning_rate=0.1,
                    max_depth=3, use_label_encoder=False, random_state=42)

ml_models = [cb, lgbmc, xgb]
ml_names = ['CB', 'LGBMC', 'XGB']

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    print(f'Running {ml_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(home_train_features, np.array(
        home_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# Linear Svc, logistic Regression and Gaussian nb with rbf kernel approximation
home_train_features = load_home_train_features_with_drop_transformed()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()

rs = RobustScaler()
sc = StandardScaler(with_mean=False)
rbf = RBFSampler(gamma=2.0, random_state=1)
nys = Nystroem(random_state=1)

lin_svc = LinearSVC(random_state=42)
lr = LogisticRegression(
    solver='liblinear', class_weight='balanced', random_state=42)
gnb = GaussianNB()

models = [lin_svc, lr, gnb]
model_names = ['LSVC', 'LR', 'GNB']

model_list = []
score_list = []
std_list = []

for i in range(len(models)):
    pipe = Pipeline(
        steps=[
            ('sc', rs),
            ('nys', nys),
            ('model', models[i])
        ]
    )

    scores = evaluate_model(home_train_features, np.array(
        home_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(model_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# home model test without home/away teams and pca
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
sc = StandardScaler()
pca = PCA(n_components=30)

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('pca', pca),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(home_train_features,
                            home_train_target, pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

## Away Features

In [None]:
# away model test with home/away teams
away_train_features = load_away_train_features_with_drop()
away_train_features = football_data_team_ohe(away_train_features)
away_train_target, _ = load_away_targets()
sc = StandardScaler()

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(away_train_features,
                            away_train_target, pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# away model test without home/away teams
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False)

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    print(f'Running {ml_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(away_train_features, np.array(
        away_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# hypothesis test for ml models with away data
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False)
sc.fit_transform(home_train_features)

fbeta_metric = make_scorer(fbeta)
mod_names = ['AB', 'GB', 'XGBRF', 'KNN']
ml_models, ml_names = get_ml_models()

for i in range(len(ml_models)):
    if ml_names[i] == 'XGB':
        for j in range(len(ml_models)):
            if ml_names[j] in mod_names:

                t, p = paired_ttest_5x2cv(
                    estimator1=ml_models[i],
                    estimator2=ml_models[j],
                    X=away_train_features,
                    y=np.array(away_train_target).reshape(-1,),
                    scoring=fbeta_metric,
                    random_seed=1
                )
                if p <= 0.05:
                    print(
                        f'{ml_names[i]} & {ml_names[j]}: Difference in model performance is most likely real')
                else:
                    print(
                        f'{ml_names[i]} & {ml_names[j]}: Both models are most likely to have the same performance')

In [None]:
# catboost and lightgbm
# away model test without home/away teams
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False)

cb = CatBoostClassifier(n_estimators=200, learning_rate=0.1, random_seed=42)
lgbmc = LGBMClassifier(n_estimators=200, learning_rate=0.1, random_state=42)
xgb = XGBClassifier(n_estimators=200, learning_rate=0.1,
                    max_depth=3, use_label_encoder=False, random_state=42)

ml_models = [cb, lgbmc, xgb]
ml_names = ['CB', 'LGBMC', 'XGB']

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    print(f'Running {ml_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(away_train_features, np.array(
        away_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# away model test without home/away teams and tansformed continuous data
away_train_features = load_away_train_features_with_drop_transformed()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False)

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    print(f'Running {ml_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(away_train_features, np.array(
        away_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

In [None]:
# away model test without home/away teams and pca
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
sc = StandardScaler()
pca = PCA(n_components=30)

ml_models, ml_names = get_ml_models()

model_list = []
score_list = []
std_list = []

for i in range(len(ml_models)):
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('pca', pca),
            ('model', ml_models[i])
        ]
    )
    scores = evaluate_model(away_train_features,
                            away_train_target, pipe, k_splits=10)
    model_list.append(ml_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg Fbeta': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg Fbeta', ascending=False).reset_index(drop=True)

##  - Summary -

Found that the removal of the home and away teams from the data resulted in negligable performance loss whilst gaining computational speed, this was seen during eda where the majority of teams individually provide very little information with exceptions to the major teams like manchester united and chelsea.

As outlined, with this data numerical function algorithms will benefit from pca being applied due to multicollinearity and tree based algorithms should fair pretty well. A number of models were tested.

Home models to proceed with:

    Logistic Regression fbeta:	0.587389
	Gradient Boosting fbeta:	0.579037
    XGBoost	fbeta:          0.578781
	Gausian NB fbeta:	0.578502
	Adaboost fbeta: 	0.577135
    Catboost fbeta:   	0.577177
    
a paired t test with 5x2 cross validation was used for statistical significance for individual models perfomance against the top performing model, all model performance is statistically likely to be different from logistic regression. A Nystroem kernel approximation was applied to logistic regression and gaussian naive bayes to see if non-linear transforms of the data would increase performance, it resulted with negligable increase in performance. Results with PCA were marginal probably due to the initial amount of features that were used for extraction but pca will be repeated after model feature selection. Surprisingly Logistic Regression and gaussian nb performed very well with multicollinearity.

Away models to proceed with:

    Adaboost fbeta:          0.643416
    Gradient Boosting fbeta: 0.639838
    XGBoost fbeta:           0.645075
    Light Gradient Boosting fbeta: 0.645203
    CatBoost fbeta:          0.640387
    
a paired t test with 5x2 cross validation was used for statistical significance for individual models perfomance against the top performing model, all top performing models are statistically likely to have the same performance as the best model. All away models were tested with the same procedures as the home models
  
  
Both Numerical function and tree based algorithms were tested on transformed and untransformed continuous data as numerical function algorithms are susceptable to outliers and tree based algorithms are not. Going forward numerical function algorithms will be tested on data that has continuous features transformed to make them more gaussian like and a robust scaler will be used to stop any influence given by outliers if any after continuous transforms, tree based algorithms are unaffected by outliers so they will be tested on untransformed continuous data and a standard scaler will be used.

As can be seen all the top home models outperform the simple model baseline of 0.555328458299842, all away top models out perform the simple model baseline of 0.5644830306923153 and majority class baseline of 0.6102523359914978. All models have gained skill which is great.

# -- Feature Selection --

## Home Feature selection

In [None]:
# new models to run.. lr, gnb, gb.... catboost

In [None]:
# rfe followed by sequential forward feature selection with grid search... gradient boosting
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
rfe_sc = StandardScaler(with_mean=False)
home_train_features_sc = rfe_sc.fit_transform(home_train_features)

print('Performing RFE')
rfe_model = GradientBoostingClassifier(
    n_estimators=200, max_features='sqrt', random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    home_train_features_sc, np.array(home_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Home Features': home_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = home_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = GradientBoostingClassifier(
    n_estimators=200, max_features='sqrt', random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(
    home_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)


gb_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
gb_sfs_df = gb_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = gb_sfs_df.iloc[0, :]['feature_names']
print(f'Num Features: {len(t_fts)}\nTop Features: {t_fts}')
gb_sfs_df

In [None]:
# select k best followed by sequential forward feature selection with grid search... gaussian nb
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
skb_sc = StandardScaler(with_mean=False)
home_train_features_sc = skb_sc.fit_transform(home_train_features)

print('Performing SkB')
# select k best, home features
skb = SelectKBest(score_func=mutual_info_classif, k='all').fit(
    home_train_features_sc, np.array(home_train_target).reshape(-1,))
feat_score_df = pd.DataFrame(
    {'Home Features': home_train_features.columns.tolist(), 'Mutual Info': skb.scores_})
feat_score_df = feat_score_df.sort_values(
    by='Mutual Info', ascending=False).reset_index(drop=True)
top_skb_feats = feat_score_df[0:25]['Home Features'].values
top_skb_feats = home_train_features[top_skb_feats]

print('Performing Sequential Forward Selection')
sfs_model = GaussianNB()
sfs_sc = StandardScaler(with_mean=False)
top_skb_feats_sc = sfs_sc.fit_transform(top_skb_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_skb_feats_sc, np.array(
    home_train_target).reshape(-1,), custom_feature_names=top_skb_feats.columns)

gnb_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
gnb_sfs_df = gnb_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = gnb_sfs_df.iloc[0, :]['feature_names']
print(f'Num Features: {len(t_fts)}\nTop Features: {t_fts}')
gnb_sfs_df

In [None]:
# select k best followed by sequential forward feature selection with grid search...
# gaussian nb with transformed features and robust scaler
home_train_features = load_home_train_features_with_drop_transformed()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
skb_rc = RobustScaler()
home_train_features_rc = skb_rc.fit_transform(home_train_features)

print('Performing SkB')
# select k best, home features
skb = SelectKBest(score_func=mutual_info_classif, k='all').fit(
    home_train_features_rc, np.array(home_train_target).reshape(-1,))
feat_score_df = pd.DataFrame(
    {'Home Features': home_train_features.columns.tolist(), 'Mutual Info': skb.scores_})
feat_score_df = feat_score_df.sort_values(
    by='Mutual Info', ascending=False).reset_index(drop=True)
top_skb_feats = feat_score_df[0:25]['Home Features'].values
top_skb_feats = home_train_features[top_skb_feats]

print('Performing Sequential Forward Selection')
sfs_model = GaussianNB()
sfs_rc = RobustScaler()
top_skb_feats_rc = sfs_rc.fit_transform(top_skb_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_skb_feats_rc, np.array(
    home_train_target).reshape(-1,), custom_feature_names=top_skb_feats.columns)

gnb_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
gnb_sfs_df = gnb_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = gnb_sfs_df.iloc[0, :]['feature_names']
print(f'Num Features: {len(t_fts)}\nTop Features: {t_fts}')
gnb_sfs_df

In [None]:
# RFE selection followed by sequential forward feature selection with grid search... logistic regression
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
rfe_sc = StandardScaler(with_mean=False)
home_train_features_sc = rfe_sc.fit_transform(home_train_features)

print('Performing RFE')
rfe_model = LogisticRegression(
    solver='liblinear', class_weight='balanced', random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    home_train_features_sc, np.array(home_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Home Features': home_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = home_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = LogisticRegression(
    solver='liblinear', class_weight='balanced', random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sfs_sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(
    home_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)

lr_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
lr_sfs_df = lr_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = lr_sfs_df.iloc[0, :]['feature_names']
print(f'Num Features: {len(t_fts)}Top Features: {t_fts}')
lr_sfs_df

In [None]:
# RFE selection followed by sequential forward feature selection with grid search...
# logistic regression with transformed features and robust scaler
home_train_features = load_home_train_features_with_drop_transformed()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
rfe_rc = RobustScaler()
home_train_features_rc = rfe_rc.fit_transform(home_train_features)

print('Performing RFE')
rfe_model = LogisticRegression(
    solver='liblinear', class_weight='balanced', random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    home_train_features_rc, np.array(home_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Home Features': home_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = home_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = LogisticRegression(
    solver='liblinear', class_weight='balanced', random_state=42)
sfs_rc = RobustScaler()
top_rfe_feats_rc = sfs_rc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_rc, np.array(
    home_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)

lr_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
lr_sfs_df = lr_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = lr_sfs_df.iloc[0, :]['feature_names']
print(f'Num Features: {len(t_fts)}Top Features: {t_fts}')
lr_sfs_df

In [None]:
# RFE selection followed by sequential forward feature selection with grid search... xgboost
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
rfe_sc = StandardScaler(with_mean=False)
home_train_features_sc = rfe_sc.fit_transform(home_train_features)

print('Performing RFE')
rfe_model = XGBClassifier(n_estimators=200,
                          learning_rate=0.1,
                          max_depth=3,
                          min_child_weight=1,
                          gamma=0,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          n_jobs=-1,
                          verbosity=0,
                          random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    home_train_features_sc, np.array(home_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Home Features': home_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = home_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = XGBClassifier(n_estimators=200,
                          learning_rate=0.1,
                          max_depth=3,
                          min_child_weight=1,
                          gamma=0,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          n_jobs=-1,
                          verbosity=0,
                          random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sfs_sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(
    home_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)

xgb_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
xgb_sfs_df = xgb_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = xgb_sfs_df.iloc[0, :]['feature_names']
print(f'Num Features: {len(t_fts)}\nTop Features: {t_fts}')
xgb_sfs_df

In [None]:
# RFE selection followed by sequential forward feature selection with grid search... adaboost
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
rfe_sc = StandardScaler(with_mean=False)
home_train_features_sc = rfe_sc.fit_transform(home_train_features)

print('Performing RFE')
rfe_model = AdaBoostClassifier(n_estimators=100, random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    home_train_features_sc, np.array(home_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Home Features': home_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = home_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = AdaBoostClassifier(n_estimators=100, random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sfs_sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(home_train_target).reshape(-1,))

ab_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
ab_sfs_df = ab_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = ab_sfs_df.iloc[0, :]['feature_names']
print(f'\nNum Features: {len(t_fts)}\nTop Features: {t_fts}')
ab_sfs_df

In [None]:
# pca for logistic regression home features with transformed data and robust scaler
pca_home_df = load_home_train_features_with_drop_transformed()  # load features
pca_home_df = pca_home_df[['AHTGS5PG_UPoutlier', 'AHTGS5PHG_UPoutlier', 'AATGS5PAG_UPoutlier', 'AHTSOT5PG_UPoutlier',
                           'AATSOT5PG_UPoutlier', 'AHTSOT5PHG_UPoutlier', 'AHTGD5PHG_UPoutlier', 'HT_PrevSeasonPos_inv',
                           'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_lowqrt', 'AHTGS5PG_upqrt_AATGC5PG_lowqrt',
                           'AHTSOT5PG_bxcx_pwrTRANSFORM', 'AATSOT5PG_bxcx_pwrTRANSFORM', 'AHTGD5PG_quantileTRANSFORM',
                           'AATGD5PG_quantileTRANSFORM', 'AwayCapacityDiff_bin_quantileTRANSFORM',
                           'bxcx_AHTSOT5PG_quantileTRANSFORM', 'bxcx_AATSOT5PG_quantileTRANSFORM',
                           'bxcx_AAT_GS_P5PG_ratio_quantileTRANSFORM']]

rc = RobustScaler()
pca_home_df = rc.fit_transform(pca_home_df)
# fit pca algorithm
pca = PCA(whiten=True).fit(pca_home_df)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')  # for each component
plt.title('Explained Variance of Home Features')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# pca for gaussian nb home features with transformed data and robust scaler
pca_home_df = load_home_train_features_with_drop_transformed()  # load features
pca_home_df = pca_home_df[['AwayCapacityDiff_bin_quantileTRANSFORM', 'HT_PrevSeasonPos_inv', 'HA_ATP5PG_diff_quantileTRANSFORM',
                           'AATSOT5PAG_bxcx_pwrTRANSFORM', 'season_month_sin', 'AHTGC5PG_UPoutlier', 'AT_PrevSeasonPos_inv',
                           'season_month']]

rc = RobustScaler()
pca_home_df = rc.fit_transform(pca_home_df)
# fit pca algorithm
pca = PCA(whiten=True).fit(pca_home_df)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')  # for each component
plt.title('Explained Variance of Home Features')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# gaussian nb / linear regression with pca reduced transformed features and robust scaler
home_train_features = load_home_train_features_with_drop_transformed()
lr_home_train_feats = home_train_features[['AHTGS5PG_UPoutlier', 'AHTGS5PHG_UPoutlier', 'AATGS5PAG_UPoutlier', 'AHTSOT5PG_UPoutlier',
                                           'AATSOT5PG_UPoutlier', 'AHTSOT5PHG_UPoutlier', 'AHTGD5PHG_UPoutlier', 'HT_PrevSeasonPos_inv',
                                           'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_lowqrt', 'AHTGS5PG_upqrt_AATGC5PG_lowqrt',
                                           'AHTSOT5PG_bxcx_pwrTRANSFORM', 'AATSOT5PG_bxcx_pwrTRANSFORM', 'AHTGD5PG_quantileTRANSFORM',
                                           'AATGD5PG_quantileTRANSFORM', 'AwayCapacityDiff_bin_quantileTRANSFORM',
                                           'bxcx_AHTSOT5PG_quantileTRANSFORM', 'bxcx_AATSOT5PG_quantileTRANSFORM',
                                           'bxcx_AAT_GS_P5PG_ratio_quantileTRANSFORM']]
gnb_home_train_feats = home_train_features[['AwayCapacityDiff_bin_quantileTRANSFORM', 'HT_PrevSeasonPos_inv',
                                            'HA_ATP5PG_diff_quantileTRANSFORM', 'AATSOT5PAG_bxcx_pwrTRANSFORM',
                                            'season_month_sin', 'AHTGC5PG_UPoutlier', 'AT_PrevSeasonPos_inv', 'season_month']]
home_train_target, _ = load_home_targets()

gnb_model = GaussianNB()
lr_model = LogisticRegression(
    solver='liblinear', class_weight='balanced', random_state=42)

rc = RobustScaler()

lr_pipe = Pipeline(
    steps=[
        ('rc', rc),
        ('pca', PCA(n_components=12, whiten=True)),
        ('model', lr_model)
    ]
)

gnb_pipe = Pipeline(
    steps=[
        ('rc', rc),
        ('pca', PCA(n_components=7, whiten=True)),
        ('model', gnb_model)
    ]
)

lr_scores = evaluate_model(
    lr_home_train_feats, home_train_target, lr_pipe, k_splits=10)
gnb_scores = evaluate_model(
    gnb_home_train_feats, home_train_target, gnb_pipe, k_splits=10)

print(
    f'logistic Regression with PCA:  Avg Fbeta: {np.mean(lr_scores)}        Std: {np.std(lr_scores)}')
print(
    f'Gaussian NB with PCA:  Avg Fbeta: {np.mean(gnb_scores)}        Std: {np.std(gnb_scores)}')

In [None]:
# catboost feature importance
home_train_features = load_home_train_features_with_drop()
home_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
home_train_target, _ = load_home_targets()
rfe_sc = StandardScaler(with_mean=False)
home_train_features_sc = rfe_sc.fit_transform(home_train_features)


cb = CatBoostClassifier(n_estimators=200,
                        learning_rate=0.1,
                        objective='Logloss',
                        bootstrap_type='Bayesian',
                        bagging_temperature=0.5,
                        max_depth=6,
                        verbose=3,
                        random_state=42)

pool = Pool(data=home_train_features_sc, label=home_train_target)

cb.fit(pool)
importances = cb.get_feature_importance(
    data=pool, fstr_type=EFstrType.FeatureImportance, verbose=2, prettified=False)
fi = pd.DataFrame({'Feature': home_train_features.columns,
                   'Importance': importances})
fi.sort_values(by='Importance', ascending=False).reset_index(drop=True)

In [None]:
cb_model = CatBoostClassifier(n_estimators=200,
                              learning_rate=0.1,
                              objective='Logloss',
                              bootstrap_type='Bayesian',
                              bagging_temperature=0.5,
                              max_depth=6,
                              verbose=3,
                              random_state=42)
home_train_features = load_home_train_features_with_drop()
home_train_target, _ = load_home_targets()
cb_feats = home_train_features[['HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv', 'AwayCapacityDiff_bin', 'AHTSOT5PHG', 'AATGD5PG',
                                'AATGS_SOT5PG_ratio', 'AATGD5PAG', 'AATGS_SOT5PAG_ratio', 'AATSOT5PAG', 'bxcx_AHTSOT5PG', 'AHTSOT5PG',
                                'bxcx_AAT_GS_P5PG_ratio', 'AHTGD5PHG', 'AHTGD5PG', 'AAT_GS_P5PG_ratio', 'AATP5PG', 'AATGC5PAG',
                                'AATSOT5PG']]

sc = StandardScaler(with_mean=True)

pipe = Pipeline(
    steps=[
        ('sc', sc),
        ('model', cb_model)
    ]
)
scores = evaluate_model(cb_feats, np.array(
    home_train_target).reshape(-1,), pipe, k_splits=10)
print(
    f'CB model top features fbeta score: {np.mean(scores)},    Std: {np.std(scores)}')

home features:

gradient boosting fbeta score: 0.584344, std: 0.011447

xgboost fbeta score: 0.584726, std: 0.00986

adaboost fbeta score: 0.581842, std: 0.009721

logistic regression fbeta: 0.592269, std: 0.009448.... logistic regression with pca fbeta score: 0.591103, std: 0.009015

gaussian nb fbeta score: 0.585609, std: 0.008748.... gaussian nb with pca fbeta score: 0.584452, std: 0.007734

catboost fbeta score: 0.578077,    std: 0.0090299

models to proceed with:     gradient boosting, xgboost, gaussian nb with pca, logistic regression with pca

## Away Feature Selection

In [None]:
# new models... xgb, gb, ab, knn... lightgbm, catboost, xgb with lr rate

In [None]:
# RFE selection followed by sequential forward feature selection with grid search... adaboost
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
rfe_sc = StandardScaler(with_mean=False)
away_train_features_sc = rfe_sc.fit_transform(away_train_features)

print('Performing RFE')
rfe_model = AdaBoostClassifier(n_estimators=100, random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    away_train_features_sc, np.array(away_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Away Features': away_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = away_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = AdaBoostClassifier(n_estimators=100, random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sfs_sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(
    away_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)

ab_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
ab_sfs_df = ab_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = ab_sfs_df.iloc[0, :]['feature_names']
print(f'\nTop Features: {t_fts}')
ab_sfs_df

In [None]:
# RFE selection followed by sequential forward feature selection with grid search... gradient boosting
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
rfe_sc = StandardScaler(with_mean=False)
away_train_features_sc = rfe_sc.fit_transform(away_train_features)

print('Performing RFE')
rfe_model = GradientBoostingClassifier(
    n_estimators=200, max_features='sqrt', random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    away_train_features_sc, np.array(away_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Away Features': away_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = away_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = GradientBoostingClassifier(
    n_estimators=200, max_features='sqrt', random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sfs_sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(
    away_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)

gb_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
gb_sfs_df = gb_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = gb_sfs_df.iloc[0, :]['feature_names']
print(f'\nTop Features: {t_fts}')
gb_sfs_df

In [None]:
# RFE selection followed by sequential forward feature selection with grid search... xgboost
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
rfe_sc = StandardScaler(with_mean=False)
away_train_features_sc = rfe_sc.fit_transform(away_train_features)

print('Performing RFE')
rfe_model = XGBClassifier(n_estimators=200,
                          learning_rate=0.1,
                          max_depth=3,
                          min_child_weight=1,
                          gamma=0,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          n_jobs=-1,
                          verbosity=0,
                          random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    away_train_features_sc, np.array(away_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Away Features': away_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = away_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = XGBClassifier(n_estimators=200,
                          learning_rate=0.1,
                          max_depth=3,
                          min_child_weight=1,
                          gamma=0,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          n_jobs=-1,
                          verbosity=0,
                          random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sfs_sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(
    away_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)

xgb_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
xgb_sfs_df = xgb_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = xgb_sfs_df.iloc[0, :]['feature_names']
print(f'\nTop Features: {t_fts}')
xgb_sfs_df

In [None]:
# RFE selection followed by sequential forward feature selection with grid search... lightgbm
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
#cat_feat = [(idx, col) for idx, col in enumerate(away_train_features.columns) if 'UPoutlier' in col or 'LOWoutlier' in col or 'Local_Derby' in col or 'Dist>=100' in col or 'cluster' in col or 'upqrt' in col or 'lowqrt' in col or 'UPLOW' in col or 'bigcapacitydiff_' in col]
# cat_feat = np.array([col[0] for col in cat_feat]) # categorical features
rfe_sc = StandardScaler(with_mean=False)
away_train_features_sc = rfe_sc.fit_transform(away_train_features)

print('Performing RFE')
rfe_model = LGBMClassifier(n_estimators=200,
                           learning_rate=0.1,
                           objective='binary',
                           max_depth=3,
                           n_jobs=-1,
                           random_state=42)
rfe = RFE(rfe_model, n_features_to_select=25).fit(
    away_train_features_sc, np.array(away_train_target).reshape(-1,))
rfe_score_df = pd.DataFrame(
    {'Away Features': away_train_features.columns.tolist(), 'Ranking': rfe.ranking_})
# get top rfe features
msk = rfe_score_df['Ranking'] == 1
msk_idx = msk[msk].index.values
top_rfe_feats = rfe_score_df.iloc[msk_idx, 0].values
top_rfe_feats = away_train_features[top_rfe_feats]

print('Performing Sequential Forward Selection')
sfs_model = LGBMClassifier(n_estimators=200,
                           learning_rate=0.1,
                           objective='binary',
                           max_depth=3,
                           n_jobs=-1,
                           random_state=42)
sfs_sc = StandardScaler(with_mean=False)
top_rfe_feats_sc = sfs_sc.fit_transform(top_rfe_feats)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

fbeta_metric = make_scorer(fbeta)

sfs = SFS(
    estimator=sfs_model,
    k_features=20,
    forward=True,
    floating=False,
    scoring=fbeta_metric,
    cv=cv,
    n_jobs=-1
)
sfs = sfs.fit(top_rfe_feats_sc, np.array(
    away_train_target).reshape(-1,), custom_feature_names=top_rfe_feats.columns)

lgbm_sfs_df = pd.DataFrame.from_dict(sfs.get_metric_dict()).T
lgbm_sfs_df = lgbm_sfs_df.sort_values(
    by='avg_score', ascending=False).reset_index(drop=True)
t_fts = lgbm_sfs_df.iloc[0, :]['feature_names']
print(f'\nTop Features: {t_fts}')
lgbm_sfs_df

In [None]:
# catboost feature importance
away_train_features = load_away_train_features_with_drop()
away_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
away_train_target, _ = load_away_targets()
rfe_sc = StandardScaler(with_mean=False)
away_train_features_sc = rfe_sc.fit_transform(away_train_features)


cb = CatBoostClassifier(n_estimators=200,
                        learning_rate=0.1,
                        objective='Logloss',
                        bootstrap_type='Bayesian',
                        bagging_temperature=0.5,
                        max_depth=6,
                        verbose=3,
                        random_state=42)

pool = Pool(data=away_train_features_sc, label=away_train_target)

cb.fit(pool)
importances = cb.get_feature_importance(
    data=pool, fstr_type=EFstrType.FeatureImportance, verbose=2, prettified=False)
fi = pd.DataFrame({'Feature': away_train_features.columns,
                   'Importance': importances})
fi.sort_values(by='Importance', ascending=False).reset_index(drop=True)

In [None]:
cb_model = CatBoostClassifier(n_estimators=200,
                              learning_rate=0.1,
                              objective='Logloss',
                              bootstrap_type='Bayesian',
                              bagging_temperature=0.5,
                              max_depth=6,
                              verbose=3,
                              random_state=42)
away_train_features = load_away_train_features_with_drop()
away_train_target, _ = load_away_targets()
cb_feats = away_train_features[['HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv', 'AwayCapacityDiff_bin', 'AHTSOT5PHG', 'AATSOT5PAG', 'AATGS_SOT5PAG_ratio',
                                'HA_ATP5PG_diff', 'year', 'AHTSOT5PG', 'AATGD5PAG', 'bxcx_AATSOT5PG', 'HA_AHTGS5PG_diff', 'AHTP5PHG', 'AATGS_SOT5PG_ratio',
                                'AAT_GS_P5PG_ratio', 'AHTGC5PHG', 'AHTGD5PG', 'AHT_GS_P5PG_ratio']]

sc = StandardScaler(with_mean=True)

pipe = Pipeline(
    steps=[
        ('sc', sc),
        ('model', cb_model)
    ]
)
scores = evaluate_model(cb_feats, np.array(
    away_train_target).reshape(-1,), pipe, k_splits=10)
print(
    f'CB model top features fbeta score: {np.mean(scores)},    Std: {np.std(scores)}')

Away features:

adaboost fbeta score: 0.643711, std: 0.005774

gradient boosting fbeat score: 0.643802, std: 0.004382

xgboost fbeta score: 0.644352, std: 0.003056

lightgbm fbeta score: 0.643192, std: 0.004525

catboost fbeta score: 0.639942,    std: 0.004579

model to proceed with is xgboost, lightgbm, gradient boosting and adaboost

## - Summary -

Given the number of features, performing sequential forward feature selection would prove inefficient computationally resulting in hours spent searching the feature space for the optimal number of features. Therefore the next best solution would be to whittle the feature space down first then perform sequential forward feature selection on the remaining best features. For the first part of this method it has been opted to perform recursive feature elimination with the models that provide weighted coefficients or importance to the features and a select k best approach for those that do not. In both cases the original dataset will be reduced to the best 25 features which will then be run with sequential forward feature selection to obtain the optimal features from this scenario.

Home model outcomes:

    Gradient Boosting... No. features: 13,    avg fbeta: 0.584344
    Gaussian NB... No. features: 8,    avg fbeta: 0.585609,    after pca... No. components: 7,   avg fbeta: 0.584452
    Logistic Regression... No. features: 19,    avg fbeta: 0.592269,    after pca... No.components: 12,    avg fbeta: 0.591103
    XGBoost... No. features: 12,    avg fbeta: 0.584726
    Adaboost... No. features: 18,    avg fbeta: 0.581842
    CatBoost... No. features: 18,    avg fbeta: 0.578077
  
After performing pca on the logistic regression model and gaussian nb model the number of features were reduced whilst still maintaining near full variance of the original dataset and with only a slight decrease in performance. The reason for keeping the models with pca reduced features when the performance has decreased even though its minimal is because the data is better setup for these models and will tribute to better generalisation. After pca Logistic regression data was reduced to only 12 feature components and gaussian nb only 7 feature components.

home models to proceed with:    gradient boosting, xgboost, gaussian nb with pca, logistic regression with pca

Away model outcomes:

    Adaboost... No. features: 20,    avg fbeta: 0.643711
    Gradient Boosting... No. features: 13,    avg fbeta: 0.643802
    XGBoost... No. features: 9,    avg fbeta: 0.644352
    Light Gradient Boosting... No. features: 12,    avg fbeta: 0.643192
    CatBoost... No. features: 18,    avg fbeta: 0.639942
    
away models to proceed with:    xgboost, lightgbm, gradient boosting and adaboost

As can be seen all models after feature selection have either increased performance or have approximately the same performance from initial model selection which is great, reduced dataset with negligable performance loss.

# -- Sampling --

## - Undersampling -

### Home Features

In [None]:
# gradient boosting, logistic regression/gaussian nb with pca
# get models and features
gb_model, gb_feats, gb_name = home_gb_setup()
lr_model, lr_feats, lr_pca, lr_name = home_lr_setup()
gnb_model, gnb_feats, gnb_pca, gnb_name = home_gnb_setup()
models = [gb_model, lr_model, gnb_model]
mod_names = [gb_name, lr_name, gnb_name]
feats = [gb_feats, lr_feats, gnb_feats]
home_train_target, _ = load_home_targets()

# get undersampling models
us_models, us_names = get_us_models()

sc = StandardScaler(with_mean=False)
rs = RobustScaler()

for i in range(len(models)):
    print(f'\nStarting: {mod_names[i]}')
    us_scores = []
    us_model_list = []
    mean_scores = []
    std_list = []
    for j in range(len(us_models)):
        if mod_names[i] == 'GB':
            pipe = IMBPipeline(
                steps=[
                    ('sc', sc),
                    ('us', us_models[j]),
                    ('model', models[i])
                ]
            )
        elif mod_names[i] == 'LR':
            pipe = IMBPipeline(
                steps=[
                    ('rs', rs),
                    ('us', us_models[j]),
                    ('pca', lr_pca),
                    ('model', models[i])
                ]
            )
        else:
            pipe = IMBPipeline(
                steps=[
                    ('rs', rs),
                    ('us', us_models[j]),
                    ('pca', gnb_pca),
                    ('model', models[i])
                ]
            )

        scores = evaluate_model(feats[i], home_train_target, pipe, k_splits=10)
        us_scores.append(scores)
        us_model_list.append(us_models[j])
        mean_scores.append(np.mean(scores))
        std_list.append(np.std(scores))

    us_df = pd.DataFrame(
        {
            'US Model': us_model_list,
            'Fbeta Score': mean_scores,
            'Std': std_list
        }
    )
    us_df = us_df.sort_values(
        by='Fbeta Score', ascending=False).reset_index(drop=True)
    print(us_df)
    plt.boxplot(us_scores, labels=us_names, showmeans=True)
    plt.title(f'{mod_names[i]} Undersampling')
    plt.grid(alpha=0.3)
    plt.show()

In [None]:
# xgboost with params
xgb_model, xgb_feats, xgb_name = home_xgb_setup()
home_train_target, _ = load_home_targets()

# get undersampling models
us_models, us_names = get_us_models()

sc = StandardScaler(with_mean=False)

us_scores = []
us_model_list = []
mean_scores = []
std_list = []
for i in range(len(us_models)):
    pipe = IMBPipeline(
        steps=[
            ('sc', sc),
            ('us', us_models[i]),
            ('model', xgb_model)
        ]
    )

    scores = evaluate_model(xgb_feats, home_train_target, pipe, k_splits=10)
    us_scores.append(scores)
    us_model_list.append(us_models[i])
    mean_scores.append(np.mean(scores))
    std_list.append(np.std(scores))

us_df = pd.DataFrame(
    {
        'US Model': us_model_list,
        'Fbeta Score': mean_scores,
        'Std': std_list
    }
)
us_df = us_df.sort_values(
    by='Fbeta Score', ascending=False).reset_index(drop=True)
print(us_df)
plt.boxplot(us_scores, labels=us_names, showmeans=True)
plt.title(f'{xgb_name} Undersampling')
plt.grid(alpha=0.3)
plt.show()

### Away Features

In [None]:
# xgboost undersampling
xgb_model, xgb_feats, xgb_name = away_xgb_setup()  # get xgboost setup
gb_model, gb_feats, gb_name = away_gb_setup()
ab_model, ab_feats, ab_name = away_ab_setup()
away_train_target, _ = load_away_targets()  # load target

models = [xgb_model, gb_model, ab_model]
mod_names = [xgb_name, gb_name, ab_name]
feats = [xgb_feats, gb_feats, ab_feats]

us_models, us_names = get_us_models()  # get undersampling models

sc = StandardScaler(with_mean=False)  # initiate scaler

# loop through undersampling models with pipeline
for i in range(len(models)):
    print(f'\nStarting: {mod_names[i]}')
    us_scores = []
    us_model_list = []
    mean_scores = []
    std_list = []
    for j in range(len(us_models)):
        pipe = IMBPipeline(
            steps=[
                ('sc', sc),
                ('us', us_models[j]),
                ('mod', models[i])
            ]
        )

        # get undersampling model scores
        scores = evaluate_model(feats[i], away_train_target, pipe, k_splits=10)
        us_scores.append(scores)  # append lists
        us_model_list.append(us_models[j])
        mean_scores.append(np.mean(scores))
        std_list.append(np.std(scores))
    # df of undersampling model scores
    us_df = pd.DataFrame(
        {
            'US Model': us_model_list,
            'Fbeta Score': mean_scores,
            'Std': std_list
        }
    )
    us_df = us_df.sort_values(
        by='Fbeta Score', ascending=False).reset_index(drop=True)
    print(us_df)
    plt.boxplot(us_scores, labels=us_names, showmeans=True)
    plt.title(f'{mod_names[i]} Undersampling')
    plt.grid(alpha=0.3)
    plt.show()

In [None]:
# xgboost with params and lightgbm
xgb_model, xgb_feats, xgb_name = away_xgb_setup()  # get xgboost setup
lgbm_model, lgbm_feats, lgbm_name = away_lgbm_setup()

away_train_target, _ = load_away_targets()  # load target

models = [xgb_model, lgbm_model]
mod_names = [xgb_name, lgbm_name]
feats = [xgb_feats, lgbm_feats]

us_models, us_names = get_us_models()  # get undersampling models

sc = StandardScaler(with_mean=False)  # initiate scaler

# loop through undersampling models with pipeline
for i in range(len(models)):
    print(f'\nStarting: {mod_names[i]}')
    us_scores = []
    us_model_list = []
    mean_scores = []
    std_list = []
    for j in range(len(us_models)):
        pipe = IMBPipeline(
            steps=[
                ('sc', sc),
                ('us', us_models[j]),
                ('mod', models[i])
            ]
        )

        # get undersampling model scores
        scores = evaluate_model(feats[i], away_train_target, pipe, k_splits=10)
        us_scores.append(scores)  # append lists
        us_model_list.append(us_models[j])
        mean_scores.append(np.mean(scores))
        std_list.append(np.std(scores))
    # df of undersampling model scores
    us_df = pd.DataFrame(
        {
            'US Model': us_model_list,
            'Fbeta Score': mean_scores,
            'Std': std_list
        }
    )
    us_df = us_df.sort_values(
        by='Fbeta Score', ascending=False).reset_index(drop=True)
    print(us_df)
    plt.boxplot(us_scores, labels=us_names, showmeans=True)
    plt.title(f'{mod_names[i]} Undersampling')
    plt.grid(alpha=0.3)
    plt.show()

## - Oversampling -

### Home Features

In [None]:
# gradient boosting, logistic regression with pca
# get models and features
gb_model, gb_feats, gb_name = home_gb_setup()
lr_model, lr_feats, lr_pca, lr_name = home_lr_setup()
gnb_model, gnb_feats, gnb_pca, gnb_name = home_gnb_setup()
models = [gb_model, lr_model, gnb_model]
mod_names = [gb_name, lr_name, gnb_name]
feats = [gb_feats, lr_feats, gnb_feats]
home_train_target, _ = load_home_targets()

# get oversampling models
os_models, os_names = get_os_models()

sc = StandardScaler(with_mean=False)

for i in range(len(models)):
    print(f'Starting: {mod_names[i]}')
    os_scores = []
    os_model_list = []
    mean_scores = []
    std_list = []
    for j in range(len(os_models)):
        if mod_names[i] == 'GB':
            pipe = IMBPipeline(
                steps=[
                    ('sc', sc),
                    ('os', os_models[j]),
                    ('model', models[i])
                ]
            )
        elif mod_names[i] == 'LR':
            pipe = IMBPipeline(
                steps=[
                    ('rs', rs),
                    ('us', os_models[j]),
                    ('pca', lr_pca),
                    ('model', models[i])
                ]
            )
        else:
            pipe = IMBPipeline(
                steps=[
                    ('rs', rs),
                    ('us', os_models[j]),
                    ('pca', gnb_pca),
                    ('model', models[i])
                ]
            )
        scores = evaluate_model(feats[i], home_train_target, pipe, k_splits=10)
        os_scores.append(scores)
        os_model_list.append(os_models[j])
        mean_scores.append(np.mean(scores))
        std_list.append(np.std(scores))

    os_df = pd.DataFrame(
        {
            'OS Model': os_model_list,
            'Fbeta Score': mean_scores,
            'Std': std_list
        }
    )
    os_df = os_df.sort_values(
        by='Fbeta Score', ascending=False).reset_index(drop=True)
    print(os_df)
    plt.boxplot(os_scores, labels=os_names, showmeans=True)
    plt.title(f'{mod_names[i]} Oversampling')
    plt.grid(alpha=0.3)
    plt.show()

In [None]:
# xgboost with params
xgb_model, xgb_feats, xgb_name = home_xgb_setup()
home_train_target, _ = load_home_targets()

# get undersampling models
os_models, os_names = get_os_models()

sc = StandardScaler(with_mean=False)

os_scores = []
os_model_list = []
mean_scores = []
std_list = []
for i in range(len(os_models)):
    pipe = IMBPipeline(
        steps=[
            ('sc', sc),
            ('os', os_models[i]),
            ('model', xgb_model)
        ]
    )

    scores = evaluate_model(xgb_feats, home_train_target, pipe, k_splits=10)
    os_scores.append(scores)
    os_model_list.append(os_models[i])
    mean_scores.append(np.mean(scores))
    std_list.append(np.std(scores))

os_df = pd.DataFrame(
    {
        'OS Model': os_model_list,
        'Fbeta Score': mean_scores,
        'Std': std_list
    }
)
os_df = os_df.sort_values(
    by='Fbeta Score', ascending=False).reset_index(drop=True)
print(os_df)
plt.boxplot(os_scores, labels=os_names, showmeans=True)
plt.title(f'{xgb_name} Oversampling')
plt.grid(alpha=0.3)
plt.show()

### Away Features

In [None]:
# xgboost oversampling
xgb_model, xgb_feats, xgb_name = away_xgb_setup()  # get xgboost setup
gb_model, gb_feats, gb_name = away_gb_setup()
ab_model, ab_feats, ab_name = away_ab_setup()
away_train_target, _ = load_away_targets()  # load target

models = [xgb_model, gb_model, ab_model]
mod_names = [xgb_name, gb_name, ab_name]
feats = [xgb_feats, gb_feats, ab_feats]

os_models, os_names = get_os_models()  # get oversampling models

sc = StandardScaler(with_mean=False)  # initiate scaler

# loop through oversampling models with pipeline
for i in range(len(models)):
    print(f'\nStarting: {mod_names[i]}')
    os_scores = []
    os_model_list = []
    mean_scores = []
    std_list = []
    for j in range(len(os_models)):
        pipe = IMBPipeline(
            steps=[
                ('sc', sc),
                ('os', os_models[j]),
                ('xgb', models[i])
            ]
        )

        # get oversampling model scores
        scores = evaluate_model(feats[i], away_train_target, pipe, k_splits=10)
        os_scores.append(scores)  # append lists
        os_model_list.append(os_models[j])
        mean_scores.append(np.mean(scores))
        std_list.append(np.std(scores))
    # df of oversampling model scores
    os_df = pd.DataFrame(
        {
            'OS Model': os_model_list,
            'Fbeta Score': mean_scores,
            'Std': std_list
        }
    )
    os_df = os_df.sort_values(
        by='Fbeta Score', ascending=False).reset_index(drop=True)
    print(os_df)
    plt.boxplot(os_scores, labels=os_names, showmeans=True)
    plt.title(f'{xgb_name} Oversampling')
    plt.grid(alpha=0.3)
    plt.show()

In [None]:
# xgboost with params and lightgbm
xgb_model, xgb_feats, xgb_name = away_xgb_setup()  # get xgboost setup
lgbm_model, lgbm_feats, lgbm_name = away_lgbm_setup()

away_train_target, _ = load_away_targets()  # load target

models = [xgb_model, lgbm_model]
mod_names = [xgb_name, lgbm_name]
feats = [xgb_feats, lgbm_feats]

os_models, os_names = get_os_models()  # get oversampling models

sc = StandardScaler(with_mean=False)  # initiate scaler

# loop through oversampling models with pipeline
for i in range(len(models)):
    print(f'\nStarting: {mod_names[i]}')
    os_scores = []
    os_model_list = []
    mean_scores = []
    std_list = []
    for j in range(len(os_models)):
        pipe = IMBPipeline(
            steps=[
                ('sc', sc),
                ('us', os_models[j]),
                ('mod', models[i])
            ]
        )

        # get oversampling model scores
        scores = evaluate_model(feats[i], away_train_target, pipe, k_splits=10)
        os_scores.append(scores)  # append lists
        os_model_list.append(os_models[j])
        mean_scores.append(np.mean(scores))
        std_list.append(np.std(scores))
    # df of undersampling model scores
    os_df = pd.DataFrame(
        {
            'OS Model': os_model_list,
            'Fbeta Score': mean_scores,
            'Std': std_list
        }
    )
    os_df = os_df.sort_values(
        by='Fbeta Score', ascending=False).reset_index(drop=True)
    print(os_df)
    plt.boxplot(os_scores, labels=os_names, showmeans=True)
    plt.title(f'{mod_names[i]} Oversampling')
    plt.grid(alpha=0.3)
    plt.show()

## - Summary -

under and oversampling is implemented to balance the dataset for classification tasks, to provide better insights of the characteristics of the minority class from which the models will learn. Without balancing the dataset the models will be good at predicting the majority class and next to useless at predicting the minority class depending on how severely imbalanced the dataset is. To help mitigate this issue undersampling or oversampling or both can be applied to the dataset; at a high level undersampling involves removing examples of the majority class, oversampling involves creating more examples of the minority class or a combination of both can be used. For each method some criteria is used to determine which examples should be sampled, for example with some oversampling methods examples near the borderline between the two classes get resampled as these minority examples can be misclassified and sampling these could help the model accurately predict the correct class.

undersampling home models... 

    Gradient Boosting with tomeklinks fbeta score: 0.589236,   std: 0.009513

    Logistic Regression with near miss ver3 fbeta score: 0.592371,    std: 0.008272
    
    Gaussian NB with near miss ver3 fbeat score: 0.576640,    std: 0.010239
    
    XGBoost with instance hardness threshold fbeta score: 0.589392,   std: 0.008605

undersampling away models....

    XGBoost with one sided selection fbeta score: 0.655338,    std: 0.004663
    
    Gradient Boosting with tomeklinks fbeat score: 0.653820,    std: 0.006291
    
    AdaBoost with tomeklinks fbeat score: 0.653018,    std: 0.006094
    
    Light GBM with one sided selection fbeta score: 0.651019,    std: 0.006110


oversampling home models... 

    Gradient Boosting with svm smote fbeta score: 0.590869,   std: 0.009334

    Logistic Regression with kmeans smote fbeta score: 0.591988,    std: 0.010471
    
    Gaussian NB with kmean smote fbeat score: 0.581018,    std: 0.006867
    
    XGBoost with borderline smote fbeta score: 0.592335,   std: 0.008925

oversampling away models....

    XGBoost with svm smote fbeta score:  0.661319    std: 0.005484
    
    Gradient Boosting with adasyn fbeta score: 0.662989    std: 0.006634
    
    AdaBoost with svm smote fbeta score: 0.657282   std: 0.007692
    
    Light GBM with svm smote fbeta score: 0.659310,    std: 0.006528


models with the greatest increase in performance to proceed with... 

    Home models: Logitic Regression with near miss ver3 undersampling, XGBoost with borderline smote oversampling
    
    Away models: XGBoost with svm smote oversampling, Gradient Boosting with adasyn oversampling

# -- Hyperparameter Tuning --

## Home Models

In [None]:
# Home.. xgboost model hyperopt
xgb_model, xgb_feats, xgb_bsmote, xgb_name = home_xgb_setup()
home_train_target, _ = load_home_targets()
sc = StandardScaler(with_mean=False, with_std=True)
pipe = IMBPipeline(
    steps=[
        ('scaler', sc),
        ('us', xgb_bsmote),
        ('model', xgb_model)
    ]
)


def xgb_objective(params):
    pipe.set_params(**params)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    metric = make_scorer(fbeta)
    scores = cross_val_score(pipe, xgb_feats, np.array(
        home_train_target).reshape(-1, ), scoring=metric, cv=cv, n_jobs=-1)
    #scores = evaluate_model(lr_feats, np.array(home_train_target).reshape(-1, ), pipe, k_splits = 10)
    avg_loss = 1 - np.mean(scores)
    return {'loss': avg_loss, 'params': params, 'status': STATUS_OK}


# define parameters
param_space = {
    'model__n_estimators': scope.int(hp.quniform('model__n_estimators', 100, 1500, 10)),
    'model__learning_rate': hp.uniform('model__learning_rate', 0.01, 1.0)
    # 'model__max_depth': scope.int(hp.quniform('model__max_depth', 3, 10, 1)),
    # 'model__min_child_weight': scope.int(hp.quniform('model__min_child_weight', 1, 12, 2)),
    # 'model__gamma': hp.uniform('model__gamma', 0.0, 0.7)
    # 'model__subsample': hp.uniform('model__subsample', 0.5, 0.9),
    # 'model__colsample_bytree': hp.uniform('model__colsample_bytree', 0.5, 0.9)
    # 'model__reg_alpha': hp.uniform('model__reg_alpha', 0.0001, 0.2),
    # 'model__reg_lambda': hp.uniform('model__reg_lambda', 0.0001, 0.2)
}

# optimize model params
trials = Trials()
best_params = fmin(fn=xgb_objective, space=param_space,
                   algo=tpe.suggest, max_evals=200, trials=trials)

best_param_space = space_eval(param_space, best_params)
print(f'Best Parameters:\n{best_param_space}')

In [None]:
""" best loss: 0.40742098389303627... after undersampling tuning
Best Parameters: {'model__max_depth': 3, 'model__min_child_weight': 8}  best loss: 0.40911686712965145
Best Parameters: {'model__gamma': 0.6936419289666603}  best loss: 0.40906388544775685
Best Parameters: {'model__colsample_bytree': 0.5920903811620744, 'model__subsample': 0.8374088758397853} best loss: 0.40730940803486004
Best Parameters: {'model__reg_alpha': 0.07425990654719038, 'model__reg_lambda': 0.16768800738594583} best loss: 0.40719825524837827
Best Parameters: {'model__learning_rate': 0.0488938809129075, 'model__n_estimators': 320} best loss: 0.4074975818679476
"""

In [None]:
# evaluate XGB tuned model
xgb_model, xgb_feats, xgb_bsmote, xgb_name = home_xgb_setup()
home_train_target, _ = load_home_targets()
sc = StandardScaler(with_mean=False, with_std=True)
pipe = IMBPipeline(
    steps=[
        ('scaler', sc),
        ('us', xgb_bsmote),
        ('model', xgb_model)
    ]
)
scores = evaluate_model(xgb_feats, np.array(
    home_train_target).reshape(-1,), pipe, k_splits=10)
print(f'Tuned XGB Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# logistic regression model hyperopt
lr_model, lr_feats, lr_pca, lr_nm3, lr_name = home_lr_setup()
home_train_target, _ = load_home_targets()
rs = RobustScaler(with_centering=False)
pipe = IMBPipeline(
    steps=[
        ('scaler', rs),
        ('us', lr_nm3),
        ('pca', lr_pca),
        ('model', lr_model)
    ]
)


def lr_objective(params):
    pipe.set_params(**params)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    metric = make_scorer(fbeta)
    scores = cross_val_score(pipe, lr_feats, np.array(
        home_train_target).reshape(-1, ), scoring=metric, cv=cv, n_jobs=1)
    #scores = evaluate_model(lr_feats, np.array(home_train_target).reshape(-1, ), pipe, k_splits = 10)
    avg_loss = 1 - np.mean(scores)
    return {'loss': avg_loss, 'params': params, 'status': STATUS_OK}


# define parameters
param_space = {
    # 'model__penalty': hp.choice('model__penalty',
    #                           [{'model__penalty': hp.choice('l1_penalty', ['l1']),
    #                                 'model__solver': hp.choice('l1_solver', ['saga', 'liblinear'])},
    #                            {'model__penalty': hp.choice('l2_penalty', ['l2']),
    #                                 'model__solver': hp.choice('l2_solver', ['newton-cg', 'lbfgs', 'sag', 'saga', 'liblinear'])},
    #                            {'model__penalty': hp.choice('none_penalty', ['none']),
    #                                 'model__solver': hp.choice('none_solver', ['newton-cg','lbfgs', 'sag', 'saga'])},
    #                            {'model__penalty': hp.choice('enet_penalty', ['elasticnet']),
    #                                 'model__solver': hp.choice('sags_solver', ['saga'])}
    #                         ]),
    'model__penalty': 'l2',
    'model__solver':  hp.choice('model__solver', ['newton-cg', 'lbfgs', 'sag', 'saga', 'liblinear']),
    'model__C': hp.uniform('model__C', 0.1, 3.0),
    'model__max_iter': scope.int(hp.quniform('model__max_iter', 100, 300, 10)),
    'model__tol': hp.loguniform('model__tol', np.log(1.0001), np.log(1.105)),
    'model__class_weight': hp.choice('model__class_weight', ['balanced', None])
}

# optimize model params
trials = Trials()
best_params = fmin(fn=lr_objective, space=param_space,
                   algo=tpe.suggest, max_evals=150, trials=trials)

best_param_space = space_eval(param_space, best_params)
print(f'Best Parameters:\n{best_param_space}')

In [None]:
"""
Best Parameters: {'scaler__with_centering': False, 'us__n_neighbors': 3, 'us__n_neighbors_ver3': 3}
        best loss: 0.40762876656556046
        
Best Parameters: ---- l2 penalty ----
{'model__C': 1.0516419703666957, 'model__class_weight': 'balanced', 'model__max_iter': 300, 'model__penalty': 'l2', 
'model__solver': 'liblinear', 'model__tol': 1.0316431882026003}
        best loss: 0.40744398053239705
        
Best Parameters: --- l1 penalty ----
{'model__C': 0.13916392773859043, 'model__class_weight': None, 'model__max_iter': 130, 'model__penalty': 'l1', 
'model__solver': 'liblinear', 'model__tol': 1.0683985965344678}
        best loss: 0.4071924589486443

Best Parameters: ---- none penalty ----
{'model__class_weight': None, 'model__max_iter': 110, 'model__penalty': 'none', 'model__solver': 'lbfgs', 
'model__tol': 1.0006085827691054}
        best loss: 0.40762871501378917
        
Best Parameters: ---- elasticnet penalty ----
{'model__C': 0.10648895061859351, 'model__class_weight': 'balanced', 'model__l1_ratio': 0.0017244126069944493, 
'model__max_iter': 200, 'model__penalty': 'elasticnet', 'model__solver': 'saga', 'model__tol': 1.019302605931096}
        best loss: 0.4113738956377505
"""

In [None]:
# home logistic regression model tuning evaluation
lr_model, lr_feats, lr_pca, lr_nm3, lr_name = home_lr_setup()
home_train_target, _ = load_home_targets()
rs = RobustScaler(with_centering=False)
pipe = IMBPipeline(
    steps=[
        ('scaler', rs),
        ('us', lr_nm3),
        ('pca', lr_pca),
        ('model', lr_model)
    ]
)
scores = evaluate_model(lr_feats, np.array(
    home_train_target).reshape(-1,), pipe, k_splits=10)
print(
    f'Home logistic regression after model tuning, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# stacking classifier with both home models
train_features = load_train_features()
lr_model, lr_feats, lr_pca, lr_nm3, lr_name = home_lr_setup()
xgb_model, xgb_feats, xgb_bsmote, xgb_name = home_xgb_setup()
home_train_target, _ = load_home_targets()
rs = RobustScaler(with_centering=False)
sc = StandardScaler(with_mean=False, with_std=True)
#cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 2, random_state = 1)
estims = [
    ('lr', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(lr_feats.columns)),
            ('scaler', rs),
            ('us', lr_nm3),
            ('pca', lr_pca),
            ('model', lr_model)
        ])
     ),
    ('xgb', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', sc),
            ('os', xgb_bsmote),
            ('model', xgb_model)
        ])
     )
]

sc = StackingClassifier(
    estimators=estims,
    final_estimator=GradientBoostingClassifier(
        n_estimators=150, random_state=42),
    cv=5,
    n_jobs=-1,
    verbose=2
)

scores = evaluate_model(train_features, np.array(
    home_train_target).reshape(-1,), sc, k_splits=10)
print(
    f'Home LR/XGB models Stacking Classifier, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# home model voting classifier
train_features = load_train_features()
lr_model, lr_feats, lr_pca, lr_nm3, lr_name = home_lr_setup()
xgb_model, xgb_feats, xgb_bsmote, xgb_name = home_xgb_setup()
home_train_target, _ = load_home_targets()
rs = RobustScaler(with_centering=False)
sc = StandardScaler(with_mean=False, with_std=True)

estims = [
    ('lr', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(lr_feats.columns)),
            ('scaler', rs),
            ('us', lr_nm3),
            ('pca', lr_pca),
            ('model', lr_model)
        ])
     ),
    ('xgb', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', sc),
            ('os', xgb_bsmote),
            ('model', xgb_model)
        ])
     )
]
vc = VotingClassifier(
    estimators=estims,
    voting='hard',
    n_jobs=-1
)
scores = evaluate_model(train_features, np.array(
    home_train_target).reshape(-1,), vc, k_splits=10)
print(f'Home {lr_name}/{xgb_name} models Voting Classifier, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

## Away Models

In [None]:
# gradient boosting model hyperopt
train_features = load_train_features()
gb_pipe, _, gb_name = away_gb_setup()
away_train_target, _ = load_away_targets()


def gb_objective(params):
    gb_pipe.set_params(**params)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    metric = make_scorer(fbeta)
    scores = cross_val_score(gb_pipe, train_features, np.array(
        away_train_target).reshape(-1, ), scoring=metric, cv=cv, n_jobs=-1)
    #scores = evaluate_model(lr_feats, np.array(home_train_target).reshape(-1, ), pipe, k_splits = 10)
    avg_loss = 1 - np.mean(scores)
    return {'loss': avg_loss, 'params': params, 'status': STATUS_OK}


# define parameters
param_space = {
    # hp.uniform('model__learning_rate', 0.001, 0.1),
    'model__learning_rate': 0.021886438886199792,
    # scope.int(hp.quniform('model__n_estimators', 100, 750, 10)),
    'model__n_estimators': 730,
    # hp.uniform('model__subsample', 0.1, 1.0),
    'model__subsample': 0.5976371247166854,
    # scope.int(hp.quniform('model__min_samples_split', 500, 2000, 25)),
    'model__min_samples_split': 1450,
    # scope.int(hp.quniform('model__min_samples_leaf', 50, 500, 10)),
    'model__min_samples_leaf': 480,
    # scope.int(hp.quniform('model__max_depth', 3, 11, 1)),
    'model__max_depth': 3,
    # scope.int(hp.quniform('model__max_features', 1, 8, 1)),
    'model__max_features': 7,
    'model__ccp_alpha': hp.uniform('model__ccp_alpha', 0.001, 0.1)
}

# optimize model params
trials = Trials()
best_params = fmin(fn=gb_objective, space=param_space,
                   algo=tpe.suggest, max_evals=50, trials=trials)

best_param_space = space_eval(param_space, best_params)
print(f'Best Parameters:\n{best_param_space}')

In [None]:
"""
Best Parameters: {'os__n_neighbors': 6, 'scaler__with_mean': False, 'scaler__with_std': True} best loss: 0.33664451412221863
Best Parameters: {'model__ccp_alpha': 1.1243946224171244, 'model__max_depth': 4.0, 'model__max_features': 8, 
                  'model__min_samples_leaf': 100, 'model__min_samples_split': 1000, 'model__subsample': 0.1967614244568458}
                  best loss: 0.38974766400850225
Best Parameters: {'model__max_depth': 3.0, 'model__max_features': 8, 'model__min_samples_leaf': 420, 
                  'model__min_samples_split': 675, 'model__subsample': 0.43918481213848615}
                   best loss: 0.33716110968295065
Best Parameters: {'model__learning_rate': 0.027115584500934167, 'model__n_estimators': 650} best loss: 0.334564047993395

##### after overfitting #####

Best Parameters:
{'model__ccp_alpha': 1.071038490980779, 'model__learning_rate': 0.045865681876250834, 'model__max_features': 4, 
 'model__n_estimators': 660, 'model__subsample': 0.7428739111521593}  best loss: 0.38974766400850225
 
Best Parameters:
{'model__learning_rate': 0.021886438886199792, 'model__max_depth': 3, 'model__max_features': 7, 
 'model__min_samples_leaf': 480, 'model__min_samples_split': 1450, 'model__n_estimators': 730, 
 'model__subsample': 0.5976371247166854}  best loss: 0.33505772349746543
"""

In [None]:
# evaluate gradient boosting tuning
gb_model, gb_feats, gb_adasyn, gb_name = away_gb_setup()
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False, with_std=True)
pipe = IMBPipeline(
    steps=[
        ('scaler', sc),
        ('os', gb_adasyn),
        ('model', gb_model)
    ]
)

scores = evaluate_model(gb_feats, np.array(
    away_train_target).reshape(-1,), pipe, k_splits=10)
print(f'Tuned GB model Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# gradient boosting with bagging
gb_model, gb_feats, gb_adasyn, gb_name = away_gb_setup()
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False, with_std=True)
bc = BaggingClassifier(
    base_estimator=gb_model,
    n_estimators=50,
    max_samples=0.8,
    max_features=0.7,
    bootstrap_features=True,
    n_jobs=-1,
    random_state=42
)
pipe = IMBPipeline(
    steps=[
        ('scaler', sc),
        ('os', gb_adasyn),
        ('model', bc)
    ]
)

scores = evaluate_model(gb_feats, np.array(
    away_train_target).reshape(-1,), pipe, k_splits=10)
print(
    f'Tuned {gb_name} model Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# gradient boosting with bagging
gb_pipe, _, gb_name = away_gb_setup()
train_features = load_train_features()
test_features = load_test_features()
away_train_target, away_test_target = load_away_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# load test features, add cluster 1 & 3 predictions, keep model features
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

gb_feats = train_features[['AHTGC5PHG', 'AATSOT5PG', 'AHTSOT5PHG', 'HA_AHTGS5PG_diff', 'AHT_GS_P5PG_ratio',
                           'AwayCapacityDiff_bin', 'cluster_1', 'cluster_3', 'HT_PrevSeasonPos_inv',
                           'AT_PrevSeasonPos_inv', 'bxcx_AHTGC5PG', 'bxcx_AATSOT5PG', 'bxcx_AHT_GS_P5PG_ratio']]

gb_model = GradientBoostingClassifier(n_estimators=650,
                                      learning_rate=0.027115584500934167,
                                      max_depth=3,
                                      max_features=8,
                                      min_samples_leaf=420,
                                      min_samples_split=675,
                                      subsample=0.43918481213848615,
                                      #ccp_alpha =  1.1243946224171244,
                                      random_state=42)
gb_adasyn = ADASYN(n_neighbors=6, random_state=1, n_jobs=-1)

bc = BaggingClassifier(
    base_estimator=gb_model,
    n_estimators=50,
    max_samples=0.8,
    max_features=0.7,
    bootstrap_features=True,
    n_jobs=-1,
    verbose=2,
    random_state=42
)
pipe = IMBPipeline(
    steps=[
        ('trans', KeepColumnsTransformer(gb_feats.columns)),
        ('scaler', StandardScaler(with_mean=False, with_std=True)),
        ('os', gb_adasyn),
        ('model', bc)
    ]
)

scores = evaluate_model(train_features, np.array(
    away_train_target).reshape(-1,), pipe, k_splits=10)
print(f'{gb_name} Train Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

pipe.fit(train_features, np.array(away_train_target).reshape(-1,))
pred = pipe.predict(test_features)
print(f'Test Fbeta: {fbeta(away_test_target, pred)}')

In [None]:
# AWAY.. xgboost model hyperopt
xgb_model, xgb_feats, xgb_svms, xgb_name = away_xgb_setup()
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False, with_std=True)
pipe = IMBPipeline(
    steps=[
        ('scaler', sc),
        ('os', xgb_svms),
        ('model', xgb_model)
    ]
)


def xgb_objective(params):
    pipe.set_params(**params)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    metric = make_scorer(fbeta)
    scores = cross_val_score(pipe, xgb_feats, np.array(
        away_train_target).reshape(-1, ), scoring=metric, cv=cv, n_jobs=-1)
    #scores = evaluate_model(lr_feats, np.array(home_train_target).reshape(-1, ), pipe, k_splits = 10)
    avg_loss = 1 - np.mean(scores)
    return {'loss': avg_loss, 'params': params, 'status': STATUS_OK}


# define parameters
param_space = {
    'model__n_estimators': scope.int(hp.quniform('model__n_estimators', 200, 2000, 10)),
    'model__learning_rate': hp.uniform('model__learning_rate', 0.01, 0.1)
    # 'model__max_depth': scope.int(hp.quniform('model__max_depth', 3, 12, 1)),
    # 'model__min_child_weight': scope.int(hp.quniform('model__min_child_weight', 1, 12, 1)),
    # 'model__gamma': hp.uniform('model__gamma', 0.0, 0.8),
    # 'model__subsample': hp.uniform('model__subsample', 0.5, 1.0),
    # 'model__colsample_bytree': hp.uniform('model__colsample_bytree', 0.5, 0.9),
    # 'model__reg_alpha': hp.uniform('model__reg_alpha', 0.001, 0.2),
    # 'model__reg_lambda': hp.uniform('model__reg_lambda', 0.001, 0.2)
}

# optimize model params
trials = Trials()
best_params = fmin(fn=xgb_objective, space=param_space,
                   algo=tpe.suggest, max_evals=100, trials=trials)

best_param_space = space_eval(param_space, best_params)
print(f'Best Parameters:\n{best_param_space}')

In [None]:
"""
Best Parameters:
{'os__k_neighbors': 8, 'os__m_neighbors': 14, 'os__svm_estimator': SVC(), 'scaler__with_mean': False, 'scaler__with_std': True}
        best loss: 0.3404996349238907
Best Parameters:
{'model__gamma': 0.6757959901913699, 'model__max_depth': 3, 'model__min_child_weight': 10, 'model__subsample': 0.7962485851393263} 
        best loss: 0.3398373294482273
Best Parameters:
{'model__colsample_bytree': 0.5735095367483903, 'model__reg_alpha': 0.13029677939259385, 'model__reg_lambda': 0.10315636718690373}
        best loss: 0.3395668383422825
Best Parameters: {'model__learning_rate': 0.018277402480679054, 'model__n_estimators': 570}
        best loss: 0.3375532945064921
"""

In [None]:
# away XGB model tuning evaluation
xgb_model, xgb_feats, xgb_svms, xgb_name = away_xgb_setup()
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False, with_std=True)
pipe = IMBPipeline(
    steps=[
        ('scaler', sc),
        ('os', xgb_svms),
        ('model', xgb_model)
    ]
)
scores = evaluate_model(xgb_feats, np.array(
    away_train_target).reshape(-1,), pipe, k_splits=10)
print(
    f'Away XGB model after tuning, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# away models stacking classifier
train_features = load_train_features()
xgb_model, xgb_feats, xgb_svms, xgb_name = away_xgb_setup()
gb_model, gb_feats, gb_adasyn, gb_name = away_gb_setup()
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False, with_std=True)
estims = [
    ('gb', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(gb_feats.columns)),
            ('scaler', sc),
            ('os', gb_adasyn),
            ('model', gb_model)
        ])
     ),
    ('xgb', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', sc),
            ('os', xgb_svms),
            ('model', xgb_model)
        ])
     )
]

sc = StackingClassifier(
    estimators=estims,
    cv=5,
    n_jobs=-1,
    verbose=2
)

scores = evaluate_model(train_features, np.array(
    away_train_target).reshape(-1,), sc, k_splits=10)
print(f'Away {gb_name}/{xgb_name} models Stacking Classifier, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# away models voting classifier
train_features = load_train_features()
xgb_model, xgb_feats, xgb_svms, xgb_name = away_xgb_setup()
gb_model, gb_feats, gb_adasyn, gb_name = away_gb_setup()
away_train_target, _ = load_away_targets()
sc = StandardScaler(with_mean=False, with_std=True)
estims = [
    ('gb', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(gb_feats.columns)),
            ('scaler', sc),
            ('os', gb_adasyn),
            ('model', gb_model)
        ])
     ),
    ('xgb', IMBPipeline(
        steps=[
            ('trans', KeepColumnsTransformer(xgb_feats.columns)),
            ('scaler', sc),
            ('os', xgb_svms),
            ('model', xgb_model)
        ])
     )
]
vc = VotingClassifier(
    estimators=estims,
    voting='soft',
    n_jobs=1
)
scores = evaluate_model(train_features, np.array(
    away_train_target).reshape(-1,), vc, k_splits=10)
print(f'Away {gb_name}/{xgb_name} models Voting Classifier, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

In [None]:
# gradient boosting model hyperopt
train_features = load_train_features()
lgbm_pipe, lgbm_name = away_lgbm_setup()
away_train_target, _ = load_away_targets()


def lgbm_objective(params):
    lgbm_pipe.set_params(**params)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)
    metric = make_scorer(fbeta)
    scores = cross_val_score(lgbm_pipe, train_features, np.array(
        away_train_target).reshape(-1, ), scoring=metric, cv=cv, n_jobs=-1)
    #scores = evaluate_model(lr_feats, np.array(home_train_target).reshape(-1, ), pipe, k_splits = 10)
    avg_loss = 1 - np.mean(scores)
    return {'loss': avg_loss, 'params': params, 'status': STATUS_OK}


# define parameters
param_space = {
    'model__learning_rate': hp.uniform('model__learning_rate', 0.001, 0.05),
    'model__n_estimators': scope.int(hp.quniform('model__n_estimators', 100, 750, 10)),
    # scope.int(hp.quniform('model__num_leaves', 30, 90, 5)),
    'model__num_leaves': 35,
    # scope.int(hp.quniform('model__min_child_samples', 100, 1500, 10)),
    'model__min_child_samples': 1320,
    # hp.uniform('model__colsample_bytree', 0.5, 1.0),
    'model__colsample_bytree': 0.7787949835735254,
    # hp.uniform('model__subsample', 0.5, 1.0),
    'model__subsample': 0.8073575692094029,
    # hp.choice('model__is_unbalance', [True, False]),
    'model__is_unbalance': True,
    # scope.int(hp.quniform('model__max_depth', 3, 11, 1)),
    'model__max_depth': 4,
    # 'model__max_bin': scope.int(hp.quniform('model__max_bin', 255, 300, 5)),
    # hp.uniform('model__reg_alpha', 0.0, 0.1),
    'model__reg_alpha': 0.06951672237498768,
    # hp.uniform('model__reg_lambda', 0.0, 0.1)
    'model__reg_lambda': 0.03784851967467663,
}

# optimize model params
trials = Trials()
best_params = fmin(fn=lgbm_objective, space=param_space,
                   algo=tpe.suggest, max_evals=75, trials=trials)

best_param_space = space_eval(param_space, best_params)
print(f'Best Parameters:\n{best_param_space}')

In [None]:
train_features = load_train_features()
lgbm_pipe, lgbm_name = away_lgbm_setup()
away_train_target, _ = load_away_targets()

scores = evaluate_model(train_features, np.array(
    away_train_target).reshape(-1,), lgbm_pipe, k_splits=10)
print(f'{lgbm_name} Avg Fbeta: {np.mean(scores)},   Std: {np.std(scores)}')

In [None]:
"""
Best Parameters:  best loss: 0.3414808670872569
{'os__k_neighbors': 3, 'os__m_neighbors': 7, 'scaler__with_mean': True, 'scaler__with_std': True}
Best Parameters:
{'model__colsample_bytree': 0.7787949835735254, 'model__is_unbalance': True, 'model__max_depth': 4, 
 'model__min_child_samples': 1320, 'model__num_leaves': 35, 'model__reg_alpha': 0.06951672237498768, 
 'model__reg_lambda': 0.03784851967467663, 'model__subsample': 0.8073575692094029}
 best loss: 0.33581922981611856
Best Parameters:
{'model__colsample_bytree': 0.7787949835735254, 'model__is_unbalance': True, 'model__learning_rate': 0.04250057994580665, 
 'model__max_depth': 4, 'model__min_child_samples': 1320, 'model__n_estimators': 470, 'model__num_leaves': 35, 
 'model__reg_alpha': 0.06951672237498768, 'model__reg_lambda': 0.03784851967467663, 'model__subsample': 0.8073575692094029}
 best loss: 0.3362585825540674
"""

## - Summary -

All model hyperparameter tuning was implemented using bayesian optimisation.

Gradient boosting tuning involved tuning tree specific parameters and boosting parameters; 

    min_samples_split: the minimum number of samples (or observations) which are required in a node to be considered for                                splitting. Can provide control for overfitting.
    min_samples_leaf: the minimum samples (or observations) required in a terminal node or leaf. Can control overfitting.
    max_depth: the maximum depth of a tree.
    max_features: The number of features to consider while searching for a best split. 
    
    learning_rate: the impact of each tree on the final outcome. Used in conjunction with n_estimators, the lower the value
                   usually provides better generalisation.
    n_estimators: The number of sequential trees to be modeled. Used in conjunction with learning_rate
    subsample: The fraction of observations to be selected for each tree. 
    
XGBoost tuning involved tuning boosting parameters;

    learning_rate: the impact of each tree on the final outcome. Used in conjunction with n_estimators, makes the model more                        robust
    n_estimators: the number of boosting trees, used in conjunction eith learning_rate - higher n_estimators to lower                             learning_rate
    min_child_weight: the minimum sum of weights of all observations required in a child. Similar to min_samples_leaf for                             gradient boosting, can control overfitting.
    max_depth: maximum depth of the tree.
    gamma: specifies the minimum loss reduction required to make a split.
    subsample: same as gradient boosting.
    colsample_bytree: the fraction of columns to be randomly sampled for each tree. similar to max features with gradient                             boosting.
    reg_lambda: L2 regularisation on weights, helps reduce overfitting
    reg_lambda: L1 regularisation on weights.
    
Logistic Regression doesnt really have any critical hyperparameters to tune, though there are hyperparameters to aid performance for the data it is modelling;

    penalty: Used to specify the norm used in the penalization.
    solver: optimisation algorithm, although there are specific algorithms for different types of data all solvers were included             in the search space.
    max_iter: number of iterations for the solver to converge
    C: controls the penalty strength
    tol: tolerance for stopping
    
Light Gradient Boosting tuning involved tuning many of the same parameters listed with xgboost tuning and gradient boosting tuning, the only difference is num_leaves: Maximum tree leaves for base learners.

The individual models were also combined into ensemble methods voting and stacking to see if performance could be gained.

Home model results:

    Logistic Regression model, Avg Fbeta: 0.5928075410513557,    Std: 0.008289114614265857
    
    XGBClassifier model, Avg Fbeta: 0.5925024181320524,    Std: 0.00791501842121371
   
    Stacked Classifier (LR/XGB) model, Avg Fbeta: 0.5836628041152927,    Std: 0.010158813335395256
    
    Voting Classifier (LR/XGB) model, Avg Fbeta: 0.5924814014193714,    Std: 0.008173213617041583
    
Away model results:

    Gradient Boosting model, Avg Fbeta: 0.665435952006605,    Std: 0.00779246295616173
    
    XGBClassifier model, Avg Fbeta: 0.6624467054935079,    Std: 0.007283174908768594
    
    LGBMClassifier model, Avg Fbeta: 0.6637414174459326,   Std: 0.008821754261260821
    
    Stacked Classifier (GB/XGB) model, Avg Fbeta: 0.6389081116604908,    Std: 0.004767996936439064
    
    Voting Classifier (GB/XGB) model, Avg Fbeta: 0.665915932158899,    Std: 0.00686345949936166
    
Home Models:   logistic regression & XGBClassifier models on their own performed the best

Away Models:   Gradient Boosting, XGBClassifier & the combination with the voting classifier performed the best

As can be seen all models have improved fbeta scores.
Proceed with testing models for generalisation also including a weighted average of stand alone models.

##Update##:
Found that gradient boosting model overfit, retuned gradient boosting model and included ccp_alpha parameter to help overfitting by pruning the tree. Corrected the overfit for good generalisation but the model is not performing aswell as the others, light gradient boosting model is now taking its place.

# -- Model Calibration --

## Home Models

In [None]:
# logistic regression calibration
train_features = load_train_features()
test_features = load_test_features()
home_train_target, home_test_target = load_home_targets()
# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)


lr_model = LogisticRegression(
    penalty='l1',
    solver='liblinear',
    C=0.13916392773859043,
    max_iter=130,
    tol=1.0683985965344678,
    class_weight=None,
    random_state=42
)

lr_feats = ['AHTGS5PG_UPoutlier', 'AHTGS5PHG_UPoutlier', 'AATGS5PAG_UPoutlier', 'AHTSOT5PG_UPoutlier',
            'AATSOT5PG_UPoutlier', 'AHTSOT5PHG_UPoutlier', 'AHTGD5PHG_UPoutlier', 'HT_PrevSeasonPos_inv',
            'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_lowqrt', 'AHTGS5PG_upqrt_AATGC5PG_lowqrt',
            'AHTSOT5PG_bxcx_pwrTRANSFORM', 'AATSOT5PG_bxcx_pwrTRANSFORM', 'AHTGD5PG_quantileTRANSFORM',
            'AATGD5PG_quantileTRANSFORM', 'AwayCapacityDiff_bin_quantileTRANSFORM',
            'bxcx_AHTSOT5PG_quantileTRANSFORM', 'bxcx_AATSOT5PG_quantileTRANSFORM',
            'bxcx_AAT_GS_P5PG_ratio_quantileTRANSFORM']

lr_pipe, _, _, lr_name = home_lr_setup()

lr_iso_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(lr_feats)),
        ('scaler', RobustScaler(with_centering=False)),
        ('sampling', NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3, n_jobs=-1)),
        ('pca', PCA(n_components=12, whiten=True)),
        ('model', CalibratedClassifierCV(lr_model, cv=cv, method='isotonic'))
    ]
)


lr_sig_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(lr_feats)),
        ('scaler', RobustScaler(with_centering=False)),
        ('sampling', NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3, n_jobs=-1)),
        ('pca', PCA(n_components=12, whiten=True)),
        ('model', CalibratedClassifierCV(lr_model, cv=cv, method='sigmoid'))
    ]
)


pipes = [(lr_pipe, lr_name), (lr_iso_cal, lr_name + ' isotonic'),
         (lr_sig_cal, lr_name + ' sigmoid')]

# plot calibration curve and histogram of predicted postive probabilities
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], linestyle='--',
         label='Perfect Calibration', color='gray')
for clf, name in pipes:
    print(f'Processing: {name}')
    clf.fit(train_features, np.array(home_train_target).reshape(-1,))
    y_pred = clf.predict(test_features)
    fb_score = fbeta(np.array(home_test_target).reshape(-1,), y_pred)
    probs = clf.predict_proba(test_features)[:, 1]
    fov, mpv = calibration_curve(
        np.array(home_test_target).reshape(-1,), probs, n_bins=10)
    ax1.plot(mpv, fov, marker='.', label=f'{name} fb:{fb_score}')
    ax2.hist(probs, range=(0, 1), bins=10, label=name, histtype='step', lw=2)
ax1.set_title('Calibration (Reliability) Curve of Logistic Regression Model')
ax1.set_ylabel('Fraction of Positives')
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc='lower right')
ax1.grid(alpha=0.3)

ax2.set_xlabel('Mean Predicted Value')
ax2.set_ylabel('Count')
ax2.legend(loc='upper right', ncol=2)
plt.tight_layout()
plt.show()

In [None]:
# xgb classifier calibration
train_features = load_train_features()
test_features = load_test_features()
home_train_target, home_test_target = load_home_targets()
# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)


xgb_model = XGBClassifier(n_estimators=320,
                          learning_rate=0.0488938809129075,
                          max_depth=3,
                          min_child_weight=8,
                          gamma=0.6936419289666603,
                          scale_pos_weight=1,
                          colsample_bytree=0.5920903811620744,
                          subsample=0.8374088758397853,
                          reg_alpha=0.07425990654719038,
                          reg_lambda=0.16768800738594583,
                          use_label_encoder=False,
                          n_jobs=-1,
                          verbosity=0,
                          random_state=42)

xgb_feats = ['AHTSOT5PG', 'AATSOT5PG', 'AATGD5PAG', 'season_month_sin', 'season_month_cos',
             'HA_ATP5PG_diff', 'AwayCapacityDiff_bin', 'cluster_3', 'HT_PrevSeasonPos_inv',
             'AT_PrevSeasonPos_inv', 'HT_GSGC_UPLOW_QRT', 'AHTGC5PG_upqrt_AATGC5PG_lowqrt']

xgb_pipe, _, _, xgb_name = home_xgb_setup()

xgb_iso_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(xgb_feats)),
        ('scaler', StandardScaler(with_mean=False, with_std=True)),
        ('sampling', BorderlineSMOTE(k_neighbors=12, m_neighbors=20,
                                     kind='borderline-1', random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(xgb_model, cv=cv, method='isotonic'))
    ]
)


xgb_sig_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(xgb_feats)),
        ('scaler', StandardScaler(with_mean=False, with_std=True)),
        ('sampling', BorderlineSMOTE(k_neighbors=12, m_neighbors=20,
                                     kind='borderline-1', random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(xgb_model, cv=cv, method='sigmoid'))
    ]
)


pipes = [(xgb_pipe, xgb_name), (xgb_iso_cal, xgb_name + ' isotonic'),
         (xgb_sig_cal, xgb_name + ' sigmoid')]

# plot calibration curve and histogram of predicted postive probabilities
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], linestyle='--',
         label='Perfect Calibration', color='gray')
for clf, name in pipes:
    print(f'Processing: {name}')
    clf.fit(train_features, np.array(home_train_target).reshape(-1,))
    y_pred = clf.predict(test_features)
    fb_score = fbeta(np.array(home_test_target).reshape(-1,), y_pred)
    probs = clf.predict_proba(test_features)[:, 1]
    fov, mpv = calibration_curve(
        np.array(home_test_target).reshape(-1,), probs, n_bins=10)
    ax1.plot(mpv, fov, marker='.', label=f'{name} fb:{fb_score}')
    ax2.hist(probs, range=(0, 1), bins=10, label=name, histtype='step', lw=2)
ax1.set_title('Calibration (Reliability) Curve of XGB Classifier Model')
ax1.set_ylabel('Fraction of Positives')
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc='lower right')
ax1.grid(alpha=0.3)

ax2.set_xlabel('Mean Predicted Value')
ax2.set_ylabel('Count')
ax2.legend(loc='upper right', ncol=2)
plt.tight_layout()
plt.show()

In [None]:
# home model, voting classifier with calibrated models
train_features = load_train_features()
xgb_pipe, _, xgb_name = home_xgb_setup()
lr_pipe, _, lr_name = home_lr_setup()
home_train_target, _ = load_home_targets()

vc = VotingClassifier(
    estimators=[('xgb', xgb_pipe), ('lr', lr_pipe)],
    voting='soft',
    n_jobs=1
)
scores = evaluate_model(train_features, np.array(
    home_train_target).reshape(-1,), vc, k_splits=10)
print(f'Home {xgb_name}/{lr_name} models Voting Classifier, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

## Away Models

In [None]:
# lgbm calibration
train_features = load_train_features()
test_features = load_test_features()
away_train_target, away_test_target = load_away_targets()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

lgbm_model = LGBMClassifier(n_estimators=470,
                            learning_rate=0.04250057994580665,
                            objective='binary',
                            max_depth=4,
                            num_leaves=35,
                            min_child_samples=1320,
                            colsample_bytree=0.7787949835735254,
                            subsample=0.8073575692094029,
                            reg_alpha=0.06951672237498768,
                            reg_lambda=0.03784851967467663,
                            is_unbalance=True,
                            n_jobs=-1,
                            random_state=42)

lgbm_feats = ['AHTGS5PG', 'AHTGC5PG', 'AHTSOT5PHG', 'AATSOT5PAG', 'AHTP5PHG', 'AHTGD5PG',
              'HA_AHTGS5PG_diff', 'HA_ATP5PG_diff', 'AHT_GS_P5PG_ratio', 'AwayCapacityDiff_bin',
              'HT_PrevSeasonPos_inv', 'AT_PrevSeasonPos_inv']

lgbm_pipe, _, _, lgbm_name = away_lgbm_setup()

lgbm_iso_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(lgbm_feats)),
        ('scaler', StandardScaler(with_mean=True, with_std=True)),
        ('sampling', SVMSMOTE(k_neighbors=3, m_neighbors=7, random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(lgbm_model, cv=cv, method='isotonic'))
    ]
)


lgbm_sig_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(lgbm_feats)),
        ('scaler', StandardScaler(with_mean=True, with_std=True)),
        ('sampling', SVMSMOTE(k_neighbors=3, m_neighbors=7, random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(lgbm_model, cv=cv, method='sigmoid'))
    ]
)

pipes = [(lgbm_pipe, lgbm_name), (lgbm_iso_cal, lgbm_name +
                                  ' isotonic'), (lgbm_sig_cal, lgbm_name + ' sigmoid')]
# plot calibration curve and histogram of predicted postive probabilities
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], linestyle='--',
         label='Perfect Calibration', color='gray')
for clf, name in pipes:
    print(f'Processing {name}')
    clf.fit(train_features, np.array(away_train_target).reshape(-1,))
    y_pred = clf.predict(test_features)
    fb_score = fbeta(away_test_target, y_pred)
    probs = clf.predict_proba(test_features)[:, 1]
    fov, mpv = calibration_curve(
        np.array(away_test_target).reshape(-1,), probs, n_bins=10)
    ax1.plot(mpv, fov, marker='.', label=f'{name} fb:{fb_score}')
    ax2.hist(probs, range=(0, 1), bins=10, label=name, histtype='step', lw=2)
ax1.set_title('Calibration (Reliability) Curve of Away LGBM Model')
ax1.set_ylabel('Fraction of Positives')
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc='lower right')
ax1.grid(alpha=0.3)

ax2.set_xlabel('Mean Predicted Value')
ax2.set_ylabel('Count')
ax2.legend(loc='upper right', ncol=2)
plt.tight_layout()
plt.show()

In [None]:
train_features = load_train_features()
test_features = load_test_features()
away_train_target, away_test_target = load_away_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
# add cluster 1 predictions
test_features['cluster_1'] = rf_pred

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

xgb_model = XGBClassifier(n_estimators=570,
                          learning_rate=0.018277402480679054,
                          max_depth=3,
                          min_child_weight=10,
                          gamma=0.6757959901913699,
                          subsample=0.7962485851393263,
                          colsample_bytree=0.5735095367483903,
                          scale_pos_weight=1,
                          reg_alpha=0.13029677939259385,
                          reg_lambda=0.10315636718690373,
                          use_label_encoder=False,
                          n_jobs=-1,
                          verbosity=0,
                          random_state=42)

xgb_feats = ['AHTGC5PG', 'AATP5PAG', 'DayofWeek', 'HA_ATP5PG_diff', 'cluster_1', 'HT_PrevSeasonPos_inv',
             'AT_PrevSeasonPos_inv', 'HA_AHTGS5PG_diff_upqrt', 'AHTGC5PG_upqrt_AATGC5PG_lowqrt']

xgb_pipe, _, _, xgb_name = away_xgb_setup()

xgb_iso_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(xgb_feats)),
        ('scaler', StandardScaler(with_mean=False, with_std=True)),
        ('sampling', SVMSMOTE(k_neighbors=8, m_neighbors=14,
                              svm_estimator=SVC(), random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(xgb_model, cv=cv, method='isotonic'))
    ]
)


xgb_sig_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(xgb_feats)),
        ('scaler', StandardScaler(with_mean=False, with_std=True)),
        ('sampling', SVMSMOTE(k_neighbors=8, m_neighbors=14,
                              svm_estimator=SVC(), random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(xgb_model, cv=cv, method='sigmoid'))
    ]
)

pipes = [(xgb_pipe, xgb_name), (xgb_iso_cal, xgb_name + ' isotonic'),
         (xgb_sig_cal, xgb_name + ' sigmoid')]
# plot calibration curve and histogram of predicted postive probabilities
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], linestyle='--',
         label='Perfect Calibration', color='gray')
for clf, name in pipes:
    print(f'Processing {name}')
    clf.fit(train_features, np.array(away_train_target).reshape(-1,))
    y_pred = clf.predict(test_features)
    fb_score = fbeta(away_test_target, y_pred)
    probs = clf.predict_proba(test_features)[:, 1]
    fov, mpv = calibration_curve(
        np.array(away_test_target).reshape(-1,), probs, n_bins=10)
    ax1.plot(mpv, fov, marker='.', label=f'{name} fb:{fb_score}')
    ax2.hist(probs, range=(0, 1), bins=10, label=name, histtype='step', lw=2)
ax1.set_title('Calibration (Reliability) Curve of Away XGB Model')
ax1.set_ylabel('Fraction of Positives')
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc='lower right')
ax1.grid(alpha=0.3)

ax2.set_xlabel('Mean Predicted Value')
ax2.set_ylabel('Count')
ax2.legend(loc='upper right', ncol=2)
plt.tight_layout()
plt.show()

In [None]:
train_features = load_train_features()
test_features = load_test_features()
away_train_target, away_test_target = load_away_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

gb_model = GradientBoostingClassifier(n_estimators=650,
                                      learning_rate=0.027115584500934167,
                                      max_depth=3,
                                      max_features=8,
                                      min_samples_leaf=420,
                                      min_samples_split=675,
                                      subsample=0.43918481213848615,
                                      #ccp_alpha =  1.1243946224171244,
                                      random_state=42)

gb_feats = ['AHTGC5PHG', 'AATSOT5PG', 'AHTSOT5PHG', 'HA_AHTGS5PG_diff', 'AHT_GS_P5PG_ratio',
            'AwayCapacityDiff_bin', 'cluster_1', 'cluster_3', 'HT_PrevSeasonPos_inv',
            'AT_PrevSeasonPos_inv', 'bxcx_AHTGC5PG', 'bxcx_AATSOT5PG', 'bxcx_AHT_GS_P5PG_ratio']

gb_pipe, _, gb_name = away_gb_setup()

gb_iso_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(gb_feats)),
        ('scaler', StandardScaler(with_mean=False, with_std=True)),
        ('sampling', ADASYN(n_neighbors=6, random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(gb_model, cv=cv, method='isotonic'))
    ]
)


gb_sig_cal = IMBPipeline(
    steps=[
        ('transformer', KeepColumnsTransformer(gb_feats)),
        ('scaler', StandardScaler(with_mean=False, with_std=True)),
        ('sampling', ADASYN(n_neighbors=6, random_state=1, n_jobs=-1)),
        ('model', CalibratedClassifierCV(gb_model, cv=cv, method='sigmoid'))
    ]
)

pipes = [(gb_pipe, gb_name), (gb_iso_cal, gb_name + ' isotonic'),
         (gb_sig_cal, gb_name + ' sigmoid')]
# plot calibration curve and histogram of predicted postive probabilities
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], linestyle='--',
         label='Perfect Calibration', color='gray')
for clf, name in pipes:
    print(f'Processing {name}')
    clf.fit(train_features, np.array(away_train_target).reshape(-1,))
    y_pred = clf.predict(test_features)
    fb_score = fbeta(away_test_target, y_pred)
    probs = clf.predict_proba(test_features)[:, 1]
    fov, mpv = calibration_curve(
        np.array(away_test_target).reshape(-1,), probs, n_bins=10)
    ax1.plot(mpv, fov, marker='.', label=f'{name} fb:{fb_score}')
    ax2.hist(probs, range=(0, 1), bins=10, label=name, histtype='step', lw=2)
ax1.set_title('Calibration (Reliability) Curve of Away GB Model')
ax1.set_ylabel('Fraction of Positives')
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc='lower right')
ax1.grid(alpha=0.3)

ax2.set_xlabel('Mean Predicted Value')
ax2.set_ylabel('Count')
ax2.legend(loc='upper right', ncol=2)
plt.tight_layout()
plt.show()

In [None]:
# away model, voting classifier with calibrated models
train_features = load_train_features()
xgb_pipe, _, xgb_name = away_xgb_setup()
lgbm_pipe, lgbm_name = away_lgbm_setup()
away_train_target, _ = load_away_targets()

vc = VotingClassifier(
    estimators=[('xgb', xgb_pipe), ('lgbm', lgbm_pipe)],
    voting='soft',
    n_jobs=1
)
scores = evaluate_model(train_features, np.array(
    away_train_target).reshape(-1,), vc, k_splits=10)
print(f'Away {xgb_name}/{lgbm_name} models Voting Classifier, Avg Fbeta: {np.mean(scores)},    Std: {np.std(scores)}')

## - Summary -

As we are using the probabilities predicted by the models we want to calibrate each model to be as close to perfect as possible to improve over confident or under confident probability predicting.

Home model calibration results:
    
    Logistic Regression, no calibration.. Fbeta: 0.5956956435585508
    
    XGB classifier, method = sigmoid.. Fbeta: 0.5851352642401303
    
Away Model calibration results:

    LGBM model, no calibration... fbeta: 0.6753807784651213 
    
    XGB model, no calibration... fbeta: 0.6770685167578759
    
    GB model, overfitting!

# -- Final Models (Generalisation/Visualisation) --

## Home Models

In [None]:
# load features/targets
train_features = load_train_features()
test_features = load_test_features()
home_train_target, home_test_target = load_home_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions to test features
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

# load pipelines
lr_pipe, cal_lr_pipe, lr_test_feats, lr_name = home_lr_setup()
xgb_pipe, cal_xgb_pipe, xgb_test_feats, xgb_name = home_xgb_setup()

# instantiate a simple model
lr_sm = LogisticRegression(class_weight='balanced', random_state=42)
lrsm_test_feats = test_features['AwayCapacityDiff']
lr_sm_pipe = IMBPipeline(
    steps=[
        ('trans', KeepColumnsTransformer(['AwayCapacityDiff'])),
        ('model', lr_sm)
    ]
)
lr_sm_name = 'LR SM'

# instantiate voting classifier
vc_pipe = VotingClassifier(
    estimators=[('lr', lr_pipe), ('xgb', cal_xgb_pipe)],
    voting='soft',
    n_jobs=-1
)
vc_name = 'VC'

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

pipes = [lr_pipe, cal_xgb_pipe, lr_sm_pipe, vc_pipe]
p_names = [lr_name, xgb_name, lr_sm_name, vc_name]
fit_pipes = [pipe.fit(train_features, np.array(
    home_train_target).reshape(-1,)) for pipe in pipes]
# pipelines for ensemble
ensemble_pipes = [lr_pipe, cal_xgb_pipe]
ensemble_fit_pipes = [pipe.fit(train_features, np.array(
    home_train_target).reshape(-1,)) for pipe in ensemble_pipes]
# save models
for i in range(len(fit_pipes)):
    joblib.dump(fit_pipes[i], f'{p_names[i]}_home_win_model_pipeline.sav')
pipe_test_features = [lr_test_feats, xgb_test_feats, lrsm_test_feats]
# instantiate lists for plotting
lc_mean_train_scores = []
lc_std_train_scores = []
lc_mean_test_scores = []
lc_std_test_scores = []
train_sizes = []
rc_rates = []
rc_threshs = []
auc_scores = []

for i in range(len(pipes)):
    pipe_pred = fit_pipes[i].predict(test_features)  # pipe_test_features[i]
    # pipe_test_features[i] # probabilities for class 1
    pipe_prob_pred = fit_pipes[i].predict_proba(test_features)[:, 1]
    fb_score = fbeta(home_test_target, pipe_pred)
    print(f'{p_names[i]} pipe Fbeta score: {fb_score}')
    print(
        f'{p_names[i]} pipe Cohen Kappa Score: {cohen_kappa_score(home_test_target, pipe_pred)}')
    print(
        f'{p_names[i]} pipe classification report:\n{classification_report(home_test_target, pipe_pred)}')
    print(
        f'{p_names[i]} pipe confusion matrix:\n{confusion_matrix(home_test_target, pipe_pred)}\n')
    # learning curve
    train_size, train_scores, test_scores = learning_curve(pipes[i], train_features, np.array(
        home_train_target).reshape(-1,), cv=cv, n_jobs=-1, random_state=1)
    mean_train_scores = np.mean(train_scores, axis=1)
    std_train_scores = np.std(train_scores, axis=1)
    mean_test_scores = np.mean(test_scores, axis=1)
    std_test_scores = np.std(test_scores, axis=1)
    lc_mean_train_scores.append(mean_train_scores)
    lc_std_train_scores.append(std_train_scores)
    lc_mean_test_scores.append(mean_test_scores)
    lc_std_test_scores.append(std_test_scores)
    train_sizes.append(train_size)
    # roc curve
    fpr, tpr, thresholds = roc_curve(home_test_target, pipe_prob_pred)
    rc_rates.append((fpr, tpr))
    rc_threshs.append(thresholds)
    auc_scores.append(auc(fpr, tpr))
    # save test features win probabilities for home teams as df
    home_team_win_probs = pd.DataFrame(
        {'HomeTeam': test_features.HomeTeam.values, f'{p_names[i]}_home_win_probs': pipe_prob_pred})
    home_team_win_probs.to_csv(
        f'{p_names[i]}_model_home_team_win_probs.csv', encoding='utf-8', index=False)
    # feature importances
    if p_names[i] == 'XGB':
        xgb_pipe.fit(train_features, np.array(home_train_target).reshape(-1,))
        importances = xgb_pipe.steps[3][1].feature_importances_
        feats = pipe_test_features[i].columns
        plt.figure(figsize=(20, 16))
        plt.bar(feats, importances)
        plt.title(f'{p_names[i]} Feature Importances')
        plt.xlabel(f'{p_names[i]} Features')
        plt.ylabel('Importance')
        plt.grid(alpha=0.3)
        plt.show()

# evaluate an equal weight ensemble
equal_weights = [1.0/len(ensemble_pipes) for _ in range(len(ensemble_pipes))]
np.savetxt('weights_for_equal_weighted_ensemble.csv',
           np.array(equal_weights), delimiter=',')  # save weights
ew_pred, ew_probs, ew_score = evaluate_ensemble(
    ensemble_fit_pipes, equal_weights, test_features, home_test_target)
# roc curve
fpr, tpr, thresholds = roc_curve(home_test_target, ew_probs)
rc_rates.append((fpr, tpr))
rc_threshs.append(thresholds)
auc_scores.append(auc(fpr, tpr))
p_names.append('EW Ens')
print(f'Equal weighted ensemble Fbeta score: {ew_score}')
print(
    f'Equal weighted ensemble Cohen Kappa Score: {cohen_kappa_score(home_test_target, ew_pred)}')
print(
    f'Equal weighted ensemble classification report:\n{classification_report(home_test_target, ew_pred)}')
print(
    f'Equal weighted ensemble confusion matrix:\n{confusion_matrix(home_test_target, ew_pred)}\n')
# save test features win probabilities for home teams as df
home_team_win_probs = pd.DataFrame(
    {'HomeTeam': test_features.HomeTeam.values, 'EW_Ensemble_home_win_probs': ew_probs})
home_team_win_probs.to_csv(
    f'EW_Ensemble_home_team_win_probs.csv', encoding='utf-8', index=False)

# weighted average ensemble
# bounds for weights
weight_bounds = [(0.0, 0.1) for _ in range(len(ensemble_pipes))]
# arguments for the loss function
search_arg = (ensemble_fit_pipes, test_features, home_test_target)
# global optimization of ensemble weights
result = differential_evolution(
    loss_function, weight_bounds, search_arg, maxiter=1000, tol=1e-7)
weights = normalise(result['x'])
np.savetxt('weights_for_optimised_weighted_ensemble.csv',
           np.array(weights), delimiter=',')  # save weights
print(f'Optimised Ensemble Weights:\n{weights}')
# evaluate ensemble with optimised weights
ow_pred, ow_probs, ow_score = evaluate_ensemble(
    ensemble_fit_pipes, weights, test_features, home_test_target)
fpr, tpr, thresholds = roc_curve(home_test_target, ow_probs)
rc_rates.append((fpr, tpr))
rc_threshs.append(thresholds)
auc_scores.append(auc(fpr, tpr))
p_names.append('OW Ens')
print(f'Optimised Weighted Average Ensemble Fbeta Score: {ow_score}')
print(
    f'Optimised Weighted Average Ensemble Cohen Kappa Score: {cohen_kappa_score(home_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble classification report:\n{classification_report(home_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble confusion matrix:\n{confusion_matrix(home_test_target, ow_pred)}\n')
# save test features win probabilities for home teams as df
home_team_win_probs = pd.DataFrame(
    {'HomeTeam': test_features.HomeTeam.values, 'OW_Ensemble_home_win_probs': ow_probs})
home_team_win_probs.to_csv(
    f'OW_Ensemble_home_team_win_probs.csv', encoding='utf-8', index=False)

# plots for learning curves
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Home Win Model Learning Curves')
fig.tight_layout()
for i in range(len(pipes)):
    if i == 0:
        ax = ax1
    elif i == 1:
        ax = ax2
    elif i == 2:
        ax = ax3
    else:
        ax = ax4
    ax.plot(train_sizes[i], lc_mean_train_scores[i],
            'o-', color='blue', label='train score')
    ax.fill_between(train_sizes[i],
                    lc_mean_train_scores[i] - lc_std_train_scores[i],
                    lc_mean_train_scores[i] + lc_std_train_scores[i],
                    alpha=0.1,
                    color='b')
    ax.plot(train_sizes[i], lc_mean_test_scores[i],
            'o-', color='r', label='cv score')
    ax.fill_between(train_sizes[i],
                    lc_mean_test_scores[i] - lc_std_test_scores[i],
                    lc_mean_test_scores[i] + lc_std_test_scores[i],
                    alpha=0.1,
                    color='r')
    ax.set_title(f'{p_names[i]} Learning Curve')
    ax.set_xlabel('Training Samples')
    ax.set_ylabel('Fbeta Score')
    ax.legend(loc='best')
    ax.grid(alpha=0.3)
plt.show()

# plots for roc curves
plt.figure(figsize=(20, 16))
plt.plot(rc_rates[0][0], rc_rates[0][1],
         label=f'{p_names[0]} (AUC: {auc_scores[0]})')
plt.plot(rc_rates[1][0], rc_rates[1][1],
         label=f'{p_names[1]} (AUC: {auc_scores[1]})')
plt.plot(rc_rates[2][0], rc_rates[2][1],
         label=f'{p_names[2]} (AUC: {auc_scores[2]})')
plt.plot(rc_rates[3][0], rc_rates[3][1],
         label=f'{p_names[3]} (AUC: {auc_scores[3]})')
plt.plot(rc_rates[4][0], rc_rates[4][1],
         label=f'{p_names[4]} (AUC: {auc_scores[4]})')
plt.plot(rc_rates[5][0], rc_rates[5][1],
         label=f'{p_names[5]} (AUC: {auc_scores[5]})')
plt.plot([0, 1], [0, 1], color='blue', linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.title('Home Win Model ROC Curves')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

# plots for cumulative response curve
plt.figure(figsize=(20, 16))
plt.plot([((idx + 1)/len(rc_threshs[0])) for idx,
          x in enumerate(rc_threshs[0])], rc_rates[0][1], label=f'{p_names[0]}')
plt.plot([((idx + 1)/len(rc_threshs[1])) for idx,
          x in enumerate(rc_threshs[1])], rc_rates[1][1], label=f'{p_names[1]}')
plt.plot([((idx + 1)/len(rc_threshs[2])) for idx,
          x in enumerate(rc_threshs[2])], rc_rates[2][1], label=f'{p_names[2]}')
plt.plot([((idx + 1)/len(rc_threshs[3])) for idx,
          x in enumerate(rc_threshs[3])], rc_rates[3][1], label=f'{p_names[3]}')
plt.plot([((idx + 1)/len(rc_threshs[4])) for idx,
          x in enumerate(rc_threshs[4])], rc_rates[4][1], label=f'{p_names[4]}')
plt.plot([((idx + 1)/len(rc_threshs[5])) for idx,
          x in enumerate(rc_threshs[5])], rc_rates[5][1], label=f'{p_names[5]}')
plt.plot([0, 1], [0, 1], color='blue', linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.title('Home Win Model Cumulative Response Curves')
plt.xlabel('Percentage of Data')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

# possibly lift curves/ profit curves

In [None]:
# optimised weight ensemble including simple logistic regression model
# load features/targets
train_features = load_train_features()
test_features = load_test_features()
home_train_target, home_test_target = load_home_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions to test features
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

# load pipelines
lr_pipe, cal_lr_pipe, lr_test_feats, lr_name = home_lr_setup()
xgb_pipe, cal_xgb_pipe, xgb_test_feats, xgb_name = home_xgb_setup()

# instantiate a simple model
lr_sm = LogisticRegression(class_weight='balanced', random_state=42)
lrsm_test_feats = test_features['AwayCapacityDiff']
lr_sm_pipe = IMBPipeline(
    steps=[
        ('trans', KeepColumnsTransformer(['AwayCapacityDiff'])),
        ('model', lr_sm)
    ]
)

ensemble_pipes = [lr_pipe, cal_xgb_pipe, lr_sm_pipe]
ensemble_fit_pipes = [pipe.fit(train_features, np.array(
    home_train_target).reshape(-1,)) for pipe in ensemble_pipes]

# bounds for weights
weight_bounds = [(0.0, 0.1) for _ in range(len(ensemble_pipes))]
# arguments for the loss function
search_arg = (ensemble_fit_pipes, test_features, home_test_target)
# global optimization of ensemble weights
result = differential_evolution(
    loss_function, weight_bounds, search_arg, maxiter=1000, tol=1e-7)
weights = normalise(result['x'])
np.savetxt('home_ensemble_optimised_weights_inc_simpmod.csv',
           np.array(weights), delimiter=',')  # save weights
print(f'Optimised Ensemble inc. Simple Model Weights:\n{weights}')
# evaluate ensemble with optimised weights
ow_pred, ow_probs, ow_score = evaluate_ensemble(
    ensemble_fit_pipes, weights, test_features, home_test_target)
fpr, tpr, thresholds = roc_curve(home_test_target, ow_probs)
rc_rates.append((fpr, tpr))
rc_threshs.append(thresholds)
auc_scores.append(auc(fpr, tpr))
p_names.append('OW Ens')
print(f'Optimised Weighted Average Ensemble Fbeta Score: {ow_score}')
print(
    f'Optimised Weighted Average Ensemble Cohen Kappa Score: {cohen_kappa_score(home_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble classification report:\n{classification_report(home_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble confusion matrix:\n{confusion_matrix(home_test_target, ow_pred)}\n')
# save test features win probabilities for home teams as df
home_team_win_probs = pd.DataFrame(
    {'HomeTeam': test_features.HomeTeam.values, 'OW_Ensemble_home_win_probs': ow_probs})
home_team_win_probs.to_csv(
    f'OW_Ensemble_inc_simpmod_home_team_win_probs.csv', encoding='utf-8', index=False)

## Away Models

In [None]:
# load features/targets
train_features = load_train_features()
test_features = load_test_features()
away_train_target, away_test_target = load_away_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions to test features
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

# load pipelines
lgbm_pipe, lgbm_cal_pipe, lgbm_test_feats, lgbm_name = away_lgbm_setup()
xgb_pipe, xgb_cal_pipe, xgb_test_feats, xgb_name = away_xgb_setup()
# instantiate a simple model
lr_sm = LogisticRegression(class_weight='balanced', random_state=42)
lrsm_test_feats = test_features['AwayCapacityDiff']
lr_sm_pipe = IMBPipeline(
    steps=[
        ('trans', KeepColumnsTransformer(['AwayCapacityDiff'])),
        ('model', lr_sm)
    ]
)
lr_sm_name = 'LR SM'
# instantiate voting classifier
vc_pipe = VotingClassifier(
    estimators=[('lgbm', lgbm_pipe), ('xgb', xgb_pipe)],
    voting='soft',
    n_jobs=-1
)
vc_name = 'VC'
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=1)

pipes = [lgbm_pipe, xgb_pipe, vc_pipe, lr_sm_pipe]
p_names = [lgbm_name, xgb_name, vc_name, lr_sm_name]
fit_pipes = [pipe.fit(train_features, np.array(
    away_train_target).reshape(-1,)) for pipe in pipes]
# pipelines for ensemble
ensemble_pipes = [lgbm_pipe, xgb_pipe]
ensemble_fit_pipes = [pipe.fit(train_features, np.array(
    away_train_target).reshape(-1,)) for pipe in ensemble_pipes]
# save models
for i in range(len(fit_pipes)):
    joblib.dump(fit_pipes[i], f'{p_names[i]}_away_win_model_pipeline.sav')
pipe_test_features = [lgbm_test_feats, xgb_test_feats, lrsm_test_feats]
# instantiate lists for plotting
lc_mean_train_scores = []
lc_std_train_scores = []
lc_mean_test_scores = []
lc_std_test_scores = []
train_sizes = []
rc_rates = []
rc_threshs = []
auc_scores = []

for i in range(len(pipes)):
    pipe_pred = fit_pipes[i].predict(test_features)  # pipe_test_features[i]
    # pipe_test_features[i] # probabilities for class 1
    pipe_prob_pred = fit_pipes[i].predict_proba(test_features)[:, 1]
    fb_score = fbeta(away_test_target, pipe_pred)
    print(f'{p_names[i]} pipe Fbeta score: {fb_score}')
    print(
        f'{p_names[i]} pipe Cohen Kappa Score: {cohen_kappa_score(away_test_target, pipe_pred)}')
    print(
        f'{p_names[i]} pipe classification report:\n{classification_report(away_test_target, pipe_pred)}')
    print(
        f'{p_names[i]} pipe confusion matrix:\n{confusion_matrix(away_test_target, pipe_pred)}\n')
    # learning curve
    train_size, train_scores, test_scores = learning_curve(pipes[i], train_features, np.array(
        away_train_target).reshape(-1,), cv=cv, n_jobs=-1, random_state=1)
    mean_train_scores = np.mean(train_scores, axis=1)
    std_train_scores = np.std(train_scores, axis=1)
    mean_test_scores = np.mean(test_scores, axis=1)
    std_test_scores = np.std(test_scores, axis=1)
    lc_mean_train_scores.append(mean_train_scores)
    lc_std_train_scores.append(std_train_scores)
    lc_mean_test_scores.append(mean_test_scores)
    lc_std_test_scores.append(std_test_scores)
    train_sizes.append(train_size)
    # roc curve
    fpr, tpr, thresholds = roc_curve(away_test_target, pipe_prob_pred)
    rc_rates.append((fpr, tpr))
    rc_threshs.append(thresholds)
    auc_scores.append(auc(fpr, tpr))
    # save test features win probabilities for away teams as df
    away_team_win_probs = pd.DataFrame(
        {'AwayTeam': test_features.AwayTeam.values, f'{p_names[i]}_away_win_probs': pipe_prob_pred})
    away_team_win_probs.to_csv(
        f'{p_names[i]}_model_away_team_win_probs.csv', encoding='utf-8', index=False)
    # feature importances
    if p_names[i] == 'XGB' or p_names[i] == 'LGBM':
        importances = fit_pipes[i].steps[3][1].feature_importances_
        feats = pipe_test_features[i].columns
        plt.figure(figsize=(20, 16))
        plt.bar(feats, importances)
        plt.title(f'{p_names[i]} Away Model Feature Importances')
        plt.xlabel(f'{p_names[i]} Features')
        plt.xticks(rotation=45)
        plt.ylabel('Importance')
        plt.grid(alpha=0.3)
        plt.show()

# evaluate an equal weight ensemble
equal_weights = [1.0/len(ensemble_pipes) for _ in range(len(ensemble_pipes))]
np.savetxt('equal_weights_for_away_models_ensemble.csv',
           np.array(equal_weights), delimiter=',')  # save weights
ew_pred, ew_probs, ew_score = evaluate_ensemble(
    ensemble_fit_pipes, equal_weights, test_features, away_test_target)
# roc curve
fpr, tpr, thresholds = roc_curve(away_test_target, ew_probs)
rc_rates.append((fpr, tpr))
rc_threshs.append(thresholds)
auc_scores.append(auc(fpr, tpr))
p_names.append('EW Ens')
print(f'Equal weighted ensemble Fbeta score: {ew_score}')
print(
    f'Equal weighted ensemble Cohen Kappa Score: {cohen_kappa_score(away_test_target, ew_pred)}')
print(
    f'Equal weighted ensemble classification report:\n{classification_report(away_test_target, ew_pred)}')
print(
    f'Equal weighted ensemble confusion matrix:\n{confusion_matrix(away_test_target, ew_pred)}\n')
# save test features win probabilities for away teams as df
away_team_win_probs = pd.DataFrame(
    {'AwayTeam': test_features.AwayTeam.values, 'EW_Ensemble_away_win_probs': ew_probs})
away_team_win_probs.to_csv(
    f'EW_Ensemble_away_team_win_probs.csv', encoding='utf-8', index=False)

# weighted average ensemble
# bounds for weights
weight_bounds = [(0.0, 0.1) for _ in range(len(ensemble_pipes))]
# arguments for the loss function
search_arg = (ensemble_fit_pipes, test_features, away_test_target)
# global optimization of ensemble weights
result = differential_evolution(
    loss_function, weight_bounds, search_arg, maxiter=1000, tol=1e-7)
weights = normalise(result['x'])
np.savetxt('optimised_weights_for_away_models_ensemble.csv',
           np.array(weights), delimiter=',')  # save weights
print(f'Optimised Ensemble Weights:\n{weights}')
# evaluate ensemble with optimised weights
ow_pred, ow_probs, ow_score = evaluate_ensemble(
    ensemble_fit_pipes, weights, test_features, away_test_target)
fpr, tpr, thresholds = roc_curve(away_test_target, ow_probs)
rc_rates.append((fpr, tpr))
rc_threshs.append(thresholds)
auc_scores.append(auc(fpr, tpr))
p_names.append('OW Ens')
print(f'Optimised Weighted Average Ensemble Fbeta Score: {ow_score}')
print(
    f'Optimised Weighted Average Ensemble Cohen Kappa Score: {cohen_kappa_score(away_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble classification report:\n{classification_report(away_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble confusion matrix:\n{confusion_matrix(away_test_target, ow_pred)}\n')
# save test features win probabilities for away teams as df
away_team_win_probs = pd.DataFrame(
    {'AwayTeam': test_features.AwayTeam.values, 'OW_Ensemble_away_win_probs': ow_probs})
away_team_win_probs.to_csv(
    f'OW_Ensemble_away_team_win_probs.csv', encoding='utf-8', index=False)

# plots for learning curves
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Away Win Model Learning Curves')
fig.tight_layout(pad=3)
for i in range(len(pipes)):
    if i == 0:
        ax = ax1
    elif i == 1:
        ax = ax2
    elif i == 2:
        ax = ax3
    else:
        ax = ax4
    ax.plot(train_sizes[i], lc_mean_train_scores[i],
            'o-', color='blue', label='train score')
    ax.fill_between(train_sizes[i],
                    lc_mean_train_scores[i] - lc_std_train_scores[i],
                    lc_mean_train_scores[i] + lc_std_train_scores[i],
                    alpha=0.1,
                    color='b')
    ax.plot(train_sizes[i], lc_mean_test_scores[i],
            'o-', color='r', label='cv score')
    ax.fill_between(train_sizes[i],
                    lc_mean_test_scores[i] - lc_std_test_scores[i],
                    lc_mean_test_scores[i] + lc_std_test_scores[i],
                    alpha=0.1,
                    color='r')
    ax.set_title(f'{p_names[i]} Learning Curve')
    ax.set_xlabel('Training Samples')
    ax.set_ylabel('Fbeta Score')
    ax.legend(loc='best')
    ax.grid(alpha=0.3)
plt.show()

# plots for roc curves
plt.figure(figsize=(20, 16))
plt.plot(rc_rates[0][0], rc_rates[0][1],
         label=f'{p_names[0]} (AUC: {auc_scores[0]})')
plt.plot(rc_rates[1][0], rc_rates[1][1],
         label=f'{p_names[1]} (AUC: {auc_scores[1]})')
plt.plot(rc_rates[2][0], rc_rates[2][1],
         label=f'{p_names[2]} (AUC: {auc_scores[2]})')
plt.plot(rc_rates[3][0], rc_rates[3][1],
         label=f'{p_names[3]} (AUC: {auc_scores[3]})')
plt.plot(rc_rates[4][0], rc_rates[4][1],
         label=f'{p_names[4]} (AUC: {auc_scores[4]})')
plt.plot(rc_rates[5][0], rc_rates[5][1],
         label=f'{p_names[5]} (AUC: {auc_scores[5]})')
plt.plot([0, 1], [0, 1], color='blue', linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.title('Away Win Model ROC Curves')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

# plots for cumulative response curve
plt.figure(figsize=(20, 16))
plt.plot([((idx + 1)/len(rc_threshs[0])) for idx,
          x in enumerate(rc_threshs[0])], rc_rates[0][1], label=f'{p_names[0]}')
plt.plot([((idx + 1)/len(rc_threshs[1])) for idx,
          x in enumerate(rc_threshs[1])], rc_rates[1][1], label=f'{p_names[1]}')
plt.plot([((idx + 1)/len(rc_threshs[2])) for idx,
          x in enumerate(rc_threshs[2])], rc_rates[2][1], label=f'{p_names[2]}')
plt.plot([((idx + 1)/len(rc_threshs[3])) for idx,
          x in enumerate(rc_threshs[3])], rc_rates[3][1], label=f'{p_names[3]}')
plt.plot([((idx + 1)/len(rc_threshs[4])) for idx,
          x in enumerate(rc_threshs[4])], rc_rates[4][1], label=f'{p_names[4]}')
plt.plot([((idx + 1)/len(rc_threshs[5])) for idx,
          x in enumerate(rc_threshs[5])], rc_rates[5][1], label=f'{p_names[5]}')
plt.plot([0, 1], [0, 1], color='blue', linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.title('Away Win Model Cumulative Response Curves')
plt.xlabel('Percentage of Data')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# optimised weight ensemble including the simple logistic regression model
# load features/targets
train_features = load_train_features()
test_features = load_test_features()
away_train_target, away_test_target = load_away_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions to test features
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

# load pipelines
lgbm_pipe, lgbm_cal_pipe, lgbm_test_feats, lgbm_name = away_lgbm_setup()
xgb_pipe, xgb_cal_pipe, xgb_test_feats, xgb_name = away_xgb_setup()
# instantiate a simple model
lr_sm = LogisticRegression(class_weight='balanced', random_state=42)
lrsm_test_feats = test_features['AwayCapacityDiff']
lr_sm_pipe = IMBPipeline(
    steps=[
        ('trans', KeepColumnsTransformer(['AwayCapacityDiff'])),
        ('model', lr_sm)
    ]
)

ensemble_pipes = [lgbm_pipe, xgb_pipe, lr_sm_pipe]
ensemble_fit_pipes = [pipe.fit(train_features, np.array(
    away_train_target).reshape(-1,)) for pipe in ensemble_pipes]

weight_bounds = [(0.0, 0.1) for _ in range(len(ensemble_pipes))]
# arguments for the loss function
search_arg = (ensemble_fit_pipes, test_features, away_test_target)
# global optimization of ensemble weights
result = differential_evolution(
    loss_function, weight_bounds, search_arg, maxiter=1000, tol=1e-7)
weights = normalise(result['x'])
np.savetxt('optimised_weights_for_away_models_inc_simpmod_ensemble.csv',
           np.array(weights), delimiter=',')  # save weights
print(f'Optimised Ensemble inc Simple Model Weights:\n{weights}')
# evaluate ensemble with optimised weights
ow_pred, ow_probs, ow_score = evaluate_ensemble(
    ensemble_fit_pipes, weights, test_features, away_test_target)
fpr, tpr, thresholds = roc_curve(away_test_target, ow_probs)
rc_rates.append((fpr, tpr))
rc_threshs.append(thresholds)
auc_scores.append(auc(fpr, tpr))
p_names.append('OW Ens')
print(
    f'Optimised Weighted Average Ensemble inc. Simple Model Fbeta Score: {ow_score}')
print(
    f'Optimised Weighted Average Ensemble inc. Simple Model Cohen Kappa Score: {cohen_kappa_score(away_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble inc. Simple Model classification report:\n{classification_report(away_test_target, ow_pred)}')
print(
    f'Optimised Weighted Average Ensemble inc. Simple Model confusion matrix:\n{confusion_matrix(away_test_target, ow_pred)}\n')
# save test features win probabilities for away teams as df
away_team_win_probs = pd.DataFrame(
    {'AwayTeam': test_features.AwayTeam.values, 'OW_Ensemble_away_win_probs': ow_probs})
away_team_win_probs.to_csv(
    f'OW_Ensemble_inc_simpmod_away_team_win_probs.csv', encoding='utf-8', index=False)

## Final model expected profit

In [None]:
# load data
test_features = load_test_features()
_, home_test_target = load_home_targets()
_, away_test_target = load_away_targets()

# get predictions for cluster 1 & 3 features
test_cluster_feats = load_cluster_test_feats()
rf_cl1_model = joblib.load('cl1_rf_model.sav')
rf_pred = rf_cl1_model.predict(test_cluster_feats)
gb_cl3_model = joblib.load('cl3_gb_model.sav')
gb_pred = gb_cl3_model.predict(test_cluster_feats)
# add cluster 1 & 3 predictions to test features
test_features['cluster_1'] = rf_pred
test_features['cluster_3'] = gb_pred

# load home models
home_lr_model = joblib.load('LR_home_win_model_pipeline.sav')
home_xgb_model = joblib.load('XGB_home_win_model_pipeline.sav')
home_lrsm_model = joblib.load('LR SM_home_win_model_pipeline.sav')
home_opt_weights = np.loadtxt(
    'home_ensemble_optimised_weights_inc_simpmod.csv', delimiter=',')

# load away models
away_lgbm_model = joblib.load('LGBM_away_win_model_pipeline.sav')
away_xgb_model = joblib.load('XGB_away_win_model_pipeline.sav')
away_lrsm_model = joblib.load('LR SM_away_win_model_pipeline.sav')
away_opt_weights = np.loadtxt(
    'optimised_weights_for_away_models_inc_simpmod_ensemble.csv', delimiter=',')

home_models = [(home_lr_model, 'home_LR'), (home_xgb_model,
                                            'home_XGB'), (home_lrsm_model, 'home_LRSM'), (_, 'home_ENS')]
away_models = [(away_lgbm_model, 'away_LGBM'), (away_xgb_model,
                                                'away_XGB'), (away_lrsm_model, 'away_LRSM'), (_, 'away_ENS')]

cb_matrix = np.array([[0, -5],
                      [0, 16]])  # 16 is an averaged return
home_pos_prior = 0.44
away_pos_prior = 0.29
models_dict = {'Models': [], 'Combined Exp. Profit': []}
for h_model, h_name in home_models:
    if h_name == 'home_ENS':
        home_ensemble = [home_lr_model, home_xgb_model, home_lrsm_model]
        h_ens_yhat_probs = [home_ensemble[i].predict_proba(
            test_features)[:, 1] for i in range(len(home_ensemble))]
        h_ens_yhat_probs = np.array(h_ens_yhat_probs)
        h_weighted_probs = np.tensordot(
            h_ens_yhat_probs, home_opt_weights, axes=((0), (0)))
        # round probabilities to classes.. 0 or 1
        h_yhat_classes = np.round(h_weighted_probs)
        h_yhat_classes = np.array([int(x) for x in h_yhat_classes])
        home_cm = confusion_matrix(home_test_target, h_yhat_classes)
        home_exp_profit = expected_profit(home_cm, cb_matrix, home_pos_prior)
    else:
        home_yhat_class = h_model.predict(test_features)
        home_cm = confusion_matrix(home_test_target, home_yhat_class)
        home_exp_profit = expected_profit(home_cm, cb_matrix, home_pos_prior)
    for a_model, a_name in away_models:
        print(f'processing {h_name} & {a_name}')
        if a_name == 'away_ENS':
            away_ensemble = [away_lgbm_model, away_xgb_model, away_lrsm_model]
            a_ens_yhat_probs = [away_ensemble[i].predict_proba(
                test_features)[:, 1] for i in range(len(away_ensemble))]
            a_ens_yhat_probs = np.array(a_ens_yhat_probs)
            a_weighted_probs = np.tensordot(
                a_ens_yhat_probs, away_opt_weights, axes=((0), (0)))
            # round probabilities to classes.. 0 or 1
            a_yhat_classes = np.round(a_weighted_probs)
            a_yhat_classes = np.array([int(x) for x in a_yhat_classes])
            away_cm = confusion_matrix(away_test_target, a_yhat_classes)
            away_exp_profit = expected_profit(
                away_cm, cb_matrix, away_pos_prior)
        else:
            # prob predictions for class 1
            away_yhat_class = a_model.predict(test_features)
            away_cm = confusion_matrix(away_test_target, away_yhat_class)
            away_exp_profit = expected_profit(
                away_cm, cb_matrix, away_pos_prior)

        comb_exp_profit = home_exp_profit + away_exp_profit
        models_dict['Models'].append(f'{h_name} & {a_name}')
        models_dict['Combined Exp. Profit'].append(comb_exp_profit)
        print(
            f'{h_name} exp. profit: {home_exp_profit},      {a_name} exp. profit: {away_exp_profit}')

model_profit_df = pd.DataFrame(models_dict)
model_profit_df = model_profit_df.sort_values(
    by='Combined Exp. Profit', ascending=False).reset_index(drop=True)
model_profit_df

## - Summary -

Home Models:

Both home models have generalised well with the Logistic Regression model coming out on top with an fbeta score of 0.595695 and a cohen kappa score of 0.181366, compared to XGBoosts fbeta score of 0.585135 and cohen kappa score of 0.156890 with cluster_3 being the feature that provided the most information. Both models out performed the simple logistic regression model.

The voting classifier didnt out perform the logistic regression model.

An equal weighted ensemble and also an optimised weighted ensemble was produced to see if there could be anymore performance gained. The equal weighted ensemble performed exactly the same as the voting classifier which would be exected but the optimised weighted ensemble gave a slight performance boost on the logistic regression model with an fbeta score of 0.596013 although a slightly lower cohen kappa score of 0.180733.

The learning curves for the logistic regression and xgboost models show that there is room for improvement by obtaining more data. The logistic regresion model shows more variablity with its cross validated scores.

All models tested show ROC curves better than random and better compared to the simple model, with AUC scores of approx 0.62. The top AUC scores are shown by the equal weighted ensemble and the optimised weighted ensemble. The ROC curves are all approximately the same with the Logistic regression and optimal weighted ensemble having a slight peak at approx. 55% true positve rate to 35% false positive rate.

The cumulative response curve shows that the logistic regression, equal weighted ensemble and optimised weighted ensemble produce roughly the same true positve rate for the percentage of data seen. The equal weighted ensemble does have a slight increase at around 60 percent of the data albeit very minimal.

After analysis the optimised weighted ensemble would be the best choice for the final home win model as it outperforms the other models even though it is only marginal. Also the optimised weighted ensemble was tested with the simple model included and it had a slight boost in performance with an fbeta score of 0.597327 and a cohen kappa score of 0.185075. Also compared to other models it has the best true positive predictions along with the best false negatives but with a slight increase in false positive predictions but this lends itself to being more risk accepting. So as a final model the optimised weighted ensemble with simple model included appears to be the best choice.

Away Models:

Both models again have generalised well even producing a slightly better fbeta score than expected, the XGBoost model performed the best with an fbeat score of 0.677068 and cohen kappa score of 0.171716 compared to the light gradient boostings fbeta score of 0.675380 and cohne kappa score of 0.164242. Light gradient boosting model found that 'AHTP5PG','AHTGC5PG' and 'AHTGS5PG' provided the most information and the XGBoost model found that 'cluster_1' and 'HA_ATP5PG_diff' provided the most information on the target. Both models outperformed the simple model.

The voting classifier increased performance with an fbeta score of 0.679937 and cohen kappa score of 0.174959

The equal weighted ensemble model performed the same as the voting classifier, the optimised weighted ensemble model increased performance with an fbeta score of 0.681059 and cohen kappa score of 0.178157.

The learning curve for the XGBoost model shows a plateau leading to a slight decline with the test score meaning that more data might not possibly improve the fbeta score. The learning curve for the light gradient boosting model shows the test score starting to plateau which also could mean more data might not improve the fbeta score. The reason behind the plateau and decline could be that the proportion of away wins is fairly low at approx. 29% and theres not many good predictors of an away win, which leads to the increase of data not having an effect on away win prediction performance. Also because an away win is harder to predict we will run into more false positive and false negative predictions, this could be because the home team/away team either over perform or under perform, dip in player confidence and other underlying factors. To boost away win prediction performance and improve the learning curves more attention to feature engineering for this area will help.

All models tested show ROC curves better than random and better than the simple model with curves all approx. the same. The optimised weighted ensemble has the best AUC score at 0.6439 just above the equal weighted ensemble.

All cumulative response curves are roughly the same up until approx. 40% of data seen then the simple models true positve rate exceeds the other models.

After analysis the final away win model is the optimised weighted ensemble but with the simple models increased true positive rate in the cumulative response curve, the optimised ensemble had the simple model included which led to an increased performance with an fbeta score of 0.682849 and cohen kappa score of 0.183988.

--------------

after calculating combined home and away model expected profits, surprisingly we found that the combination of the home optimised model and away simple logistic regression model produced the highest expected profit of 3.93... in the longterm on average we can expect 3.93 return from each bet the model predicts we should bet on. This is compared to the two top performing models, the home and away optimised models, producing and expected profit of 3.74. Looking at the confusion matrices it can be seen why the away simple logistic regression model helps produce the best expected profit, the away simple model shows in this domain that it is extremely risk accepting showing nearly double the amount of false positives compared to true positives and the smallest number of false negatives amongst the away models. This does show that allowing more risk can reap more rewards, although the expected profit was calculated using and averaged return value from a £5 bet and is not using the odds for the teams, so this could change if the original odds were used as the profit margins will change but it does provide a comparison. As using the away simple model produces a huge amount of false positives compared to the other away models, using it by itself might prove to be too untrustworthy as the cohen kappa score is half that of the away best ensemble model.

For future analysis of the away models more focus should be applied to maximising recall as this will allow the models to make more correct predictions (minimise false negatives), it can be seen from away model confusion matrices that the away false negatives are more than double false positives compared to the home models where they are approximately equal. Future analysis of the home models should include more attention to weighting fbeta towards recall to maximise more rewards

Both home and away top ensemble models produce cohen kappa scores of approx. 0.18, which are not trustworthy scores for predictive modelling and this reflects the fact that there is a lot of variability within the results of football matches

# -- Models for  cluster predictions --

In [None]:
# features for cluster 1 & 3 predictions
train_features = load_train_features()
train_features.head()
cluster_train_feats = []
# get columns that were used for kmeans clustering
for f in train_features.columns:
    if f != 'cluster_0':
        cluster_train_feats.append(f)
    else:
        break
cluster_train_feats = train_features[cluster_train_feats].copy()
drop_cols = ['Div', 'FTHG', 'FTAG', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
             'AC', 'HY', 'AY', 'HR', 'AR']
cluster_train_feats.drop(columns=drop_cols, inplace=True)  # drop cols cant use
# cluster_train_feats = football_data_team_ohe(cluster_train_feats) # ohe
# targets for cluster predictions
cluster1_targ = train_features['cluster_1'].copy()
cluster3_targ = train_features['cluster_3'].copy()

In [None]:
# test features for clusters
test_features = load_test_features()
cluster_train_features = load_cluster_features()
cluster_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
tf = cluster_train_features.columns
cluster_test_features = test_features[tf]

In [None]:
# split train/test
# load data
cluster_features = load_cluster_features()
cluster1_target = load_cluster1_target()
cluster3_target = load_cluster3_target()
# split data
cl1_train_features, cl1_test_features, cl1_train_target, cl1_test_target = train_test_split(
    cluster_features, cluster1_target, test_size=0.2, random_state=42, stratify=cluster1_target)
cl3_train_features, cl3_test_features, cl3_train_target, cl3_test_target = train_test_split(
    cluster_features, cluster3_target, test_size=0.2, random_state=42, stratify=cluster3_target)
# save data
# cluster 1
cl1_train_features.to_csv('cluster1_train_features.csv',
                          encoding='utf-8', index=False)
cl1_test_features.to_csv('cluster1_test_features.csv',
                         encoding='utf-8', index=False)
cl1_train_target.to_csv('cluster1_train_target.csv',
                        encoding='utf-8', index=False)
cl1_test_target.to_csv('cluster1_test_target.csv',
                       encoding='utf-8', index=False)
# cluster 3
cl3_train_features.to_csv('cluster3_train_features.csv',
                          encoding='utf-8', index=False)
cl3_test_features.to_csv('cluster3_test_features.csv',
                         encoding='utf-8', index=False)
cl3_train_target.to_csv('cluster3_train_target.csv',
                        encoding='utf-8', index=False)
cl3_test_target.to_csv('cluster3_test_target.csv',
                       encoding='utf-8', index=False)

## Cluster 1 Modelling

In [None]:
# cluster model test without home/away teams
cl1_train_feats, _ = load_cluster1_features()
cl1_train_target, _ = load_cluster1_targets()
cl1_train_feats.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)

sc = StandardScaler(with_mean=False)

cl_models, cl_names = load_cluster_models()

model_list = []
score_list = []
std_list = []

for i in range(len(cl_models)):
    print(f'Running {cl_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', cl_models[i])
        ]
    )

    scores = evaluate_cluster_model(cl1_train_feats, np.array(
        cl1_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(cl_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg F1 Score': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg F1 Score', ascending=False).reset_index(drop=True)

In [None]:
# cluster model test with home/away teams
cl1_train_feats, _ = load_cluster1_features()
cl1_train_target, _ = load_cluster1_targets()
ohtt = OHETeamTransformer()
ohe_cl1_train_feats = ohtt.transform(cl1_train_feats)

sc = StandardScaler(with_mean=False)

cl_models, cl_names = load_cluster_models()

model_list = []
score_list = []
std_list = []

for i in range(len(cl_models)):
    print(f'Running {cl_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', cl_models[i])
        ]
    )

    scores = evaluate_cluster_model(ohe_cl1_train_feats, np.array(
        cl1_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(cl_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg F1 Score': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg F1 Score', ascending=False).reset_index(drop=True)

In [None]:
# generalisation
# load data
cl1_train_features, cl1_test_features = load_cluster1_features()
cl1_train_target, cl1_test_target = load_cluster1_targets()
# remove home/away teams
cl1_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
cl1_test_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
# fit & predict
rf = RandomForestClassifier(n_estimators=200, class_weight='balanced',
                            max_features='sqrt', n_jobs=-1, random_state=42)
rf.fit(cl1_train_features, np.array(cl1_train_target).reshape(-1,))
rf_pred = rf.predict(cl1_test_features)
print(confusion_matrix(cl1_test_target, rf_pred))

In [None]:
# save cluster 1 random forest model
joblib.dump(rf, 'cl1_rf_model.sav')
# load model
# model = joblib.load('cl1_rf_model.sav')

## Cluster 3 Modelling

In [None]:
# cluster model test without home/away teams
cl3_train_feats, _ = load_cluster3_features()
cl3_train_target, _ = load_cluster3_targets()
cl3_train_feats.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)

sc = StandardScaler(with_mean=False)

cl_models, cl_names = load_cluster_models()

model_list = []
score_list = []
std_list = []

for i in range(len(cl_models)):
    print(f'Running {cl_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', cl_models[i])
        ]
    )

    scores = evaluate_cluster_model(cl3_train_feats, np.array(
        cl3_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(cl_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg F1 Score': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg F1 Score', ascending=False).reset_index(drop=True)

In [None]:
# cluster model test with home/away teams
cl3_train_feats, _ = load_cluster3_features()
cl3_train_target, _ = load_cluster3_targets()
ohtt = OHETeamTransformer()
ohe_cl3_train_feats = ohtt.transform(cl3_train_feats)

sc = StandardScaler(with_mean=False)

cl_models, cl_names = load_cluster_models()

model_list = []
score_list = []
std_list = []

for i in range(len(cl_models)):
    print(f'Running {cl_names[i]} Model')
    pipe = Pipeline(
        steps=[
            ('sc', sc),
            ('model', cl_models[i])
        ]
    )

    scores = evaluate_cluster_model(ohe_cl3_train_feats, np.array(
        cl3_train_target).reshape(-1,), pipe, k_splits=10)
    model_list.append(cl_names[i])
    score_list.append(np.mean(scores))
    std_list.append(np.std(scores))
model_performance = pd.DataFrame(
    {'Model': model_list, 'Avg F1 Score': score_list, 'Std': std_list})
model_performance.sort_values(
    by='Avg F1 Score', ascending=False).reset_index(drop=True)

In [None]:
# generalisation
# load data
cl3_train_features, cl3_test_features = load_cluster3_features()
cl3_train_target, cl3_test_target = load_cluster3_targets()
# remove home/away teams
cl3_train_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
cl3_test_features.drop(columns=['HomeTeam', 'AwayTeam'], inplace=True)
# fit & predict
gb = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_features='sqrt', random_state=42)
gb.fit(cl3_train_features, np.array(cl3_train_target).reshape(-1,))
gb_pred = gb.predict(cl3_test_features)
print(confusion_matrix(cl3_test_target, gb_pred))

In [None]:
# save cluster 3 gradient boosting model
joblib.dump(gb, 'cl3_gb_model.sav')
# load model
# model = joblib.load('cl3_gb_model.sav')