# <a id='toc1_'></a>[COMP0036 Group Assignment: BEAT THE BOOKIE](#toc0_)

## <a id='toc1_1_'></a>[Group N - Introduction](#toc0_)

We have been assigned to build model(s) that predict the FTR value, which can be Home Win (H), Draw (D) and Away Win (A). The general steps we will be taking to build the model(s) begins with finding a suitable dataset and performing feature engineering on the selected features to be used in the model. This entails creating functions or classes to convert the raw data and transforms it into a format where every match has that historic feature. Then, we perform feature selection to filter out unimportant features, and use the selected features in model(s), and then compare and decide the best performing model. Finally, improve models to get the best accuracy.

**Table of contents**<a id='toc0_'></a>    
- [COMP0036 Group Assignment: BEAT THE BOOKIE](#toc1_)    
  - [Group N - Introduction](#toc1_1_)    
  - [Index](#toc1_2_)    
- [Data Import](#toc2_)    
  - [Libraries & data source define](#toc2_1_)    
    - [Libraries](#toc2_1_1_)    
    - [RUNNING Config (GPU/CPU)](#toc2_1_2_)    
    - [FLAGS TO DISABLE false positive warnings](#toc2_1_3_)    
    - [Data Source path](#toc2_1_4_)    
  - [Raw data loading and inspection](#toc2_2_)    
- [Data Transformation & Exploration](#toc3_)    
  - [Initial transformations](#toc3_1_)    
    - [Replacing 'Date' strings with DateTime objects](#toc3_1_1_)    
    - [Adding standings and rankings data to the Dataframe](#toc3_1_2_)    
    - [Adding manager data to the Dataframe](#toc3_1_3_)    
    - [Encoding Categorical Data](#toc3_1_4_)    
  - [Data Exploration](#toc3_2_)    
- [Feature Engineering](#toc4_)    
    - [Adding Average Past Match Statistics & Past Season % Number Of Wins](#toc4_1_1_)    
    - [Adding Expected Goals](#toc4_1_2_)    
    - [Removing Pre-encoded Data](#toc4_1_3_)    
  - [Breakdown of Features In The Dataframe/Dataset](#toc4_2_)    
  - [Final Dataframe containing all features](#toc4_3_)    
    - [PLOT COOR MATRIX AGAIN (DEBUG)](#toc4_3_1_)    
- [Auxiliary Functions + Classifier interfaces](#toc5_)    
  - [Evaluation helpers](#toc5_1_)    
  - [Plotting helpers](#toc5_2_)    
  - [Metrics and classifiers](#toc5_3_)    
    - [scoring metrics and define cross validation data split](#toc5_3_1_)    
      - [helper function for producing report](#toc5_3_1_1_)    
    - [Classifiers](#toc5_3_2_)    
      - [Random Guesses](#toc5_3_2_1_)    
      - [Decision Tree Classifier](#toc5_3_2_2_)    
      - [Random Forest Classifier](#toc5_3_2_3_)    
      - [K-Nearest Neighbours (KNN) Classifier](#toc5_3_2_4_)    
      - [Support Vector Machine (SVM) Classifier](#toc5_3_2_5_)    
      - [XGB](#toc5_3_2_6_)    
      - [Neural Network](#toc5_3_2_7_)    
        - [Build function of the NN](#toc5_3_2_7_1_)    
      - [Feature set interfaces](#toc5_3_2_8_)    
        - [without cross-validation](#toc5_3_2_8_1_)    
        - [with cross-validation](#toc5_3_2_8_2_)    
- [Model Selection via Cross Validation](#toc6_)    
    - [Introduction](#toc6_1_1_)    
    - [FEATURE SET 5 (MANUAL) – WITH Model Selection](#toc6_1_2_)    
      - [Create Design Matrix](#toc6_1_2_1_)    
  - [CONTINUE ON BEST FEATURE SET & EXPLORE MORE ON NN](#toc6_2_)    
      - [redefine summary info](#toc6_2_1_1_)    
    - [Aux Functions](#toc6_2_2_)    
      - [Build function](#toc6_2_2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_2_'></a>[Index](#toc0_)

# <a id='toc2_'></a>[Data Import](#toc0_)
Here we import the libraries and define the data source

## <a id='toc2_1_'></a>[Libraries & data source define](#toc0_)

### <a id='toc2_1_1_'></a>[Libraries](#toc0_)

In [None]:
import os
import pandas as pd
import numpy as np

import operator
import random
from calendar import month_name
# import seaborn as sns
from pandas.core.common import random_state
import tensorflow as tf

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import balanced_accuracy_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection


from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import svm
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV

from sklearn.metrics import balanced_accuracy_score
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SequentialFeatureSelector

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from numpy import mean
from numpy import std
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

from scikeras.wrappers import KerasClassifier

from datetime import datetime
import math

from matplotlib import pyplot as plt

### <a id='toc2_1_2_'></a>[RUNNING Config (GPU/CPU)](#toc0_)

In [None]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

In [None]:
# Comment out this cell if not using GPU acceleration
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
print(gpu_devices)
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)

# This statement is used to log if tf is using GPU
#tf.debugging.set_log_device_placement(True)

In [None]:
# Check CUDA
!conda list cudatoolkit

!conda list cudnn

### <a id='toc2_1_3_'></a>[FLAGS TO DISABLE false positive warnings](#toc0_)

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

### <a id='toc2_1_4_'></a>[Data Source path](#toc0_)

In [None]:
dirName_matchData = 'https://raw.githubusercontent.com/shabir-dhillon/GCW_0036/main/Group%20Coursework%20Brief-20221106/Data_Files/epl-full-training.csv'
dirName_rankingData = 'https://raw.githubusercontent.com/shabir-dhillon/GCW_0036/main/Group%20Coursework%20Brief-20221106/Data_Files/EPL%20Standings%202000-2022.csv'
dirName_managerData = 'https://raw.githubusercontent.com/shabir-dhillon/GCW_0036/main/Group%20Coursework%20Brief-20221106/Data_Files/epl-manager-data-V3.csv'

## <a id='toc2_2_'></a>[Raw data loading and inspection](#toc0_)

*TODO: Write short description about what this dataset represents*

In [None]:
# Read the main csv file
df_epl = pd.read_csv(dirName_matchData)
# Check the raw data
df_epl

*TODO: Write short description about what this dataset represents*

In [None]:
# Read the standings csv file
df_ranking = pd.read_csv(dirName_rankingData)
df_ranking

*TODO: Write short description about what this dataset represents*

In [None]:
# Read the manager csv file
df_manager = pd.read_csv(dirName_managerData, encoding="ISO-8859-1")
df_manager

# <a id='toc3_'></a>[Data Transformation & Exploration](#toc0_)

## <a id='toc3_1_'></a>[Initial transformations](#toc0_)

### <a id='toc3_1_1_'></a>[Replacing 'Date' strings with DateTime objects](#toc0_)

In the raw dataset, the `'Date'` column has a `string` type. Converting it to DateTime object will allow easier usage. It also allows for extracting our first features - day, month and year of the game. We will analyse their importance later.

In [None]:
df_epl["Date"] = pd.to_datetime(df_epl["Date"], dayfirst=True)

df_epl["Day"] = df_epl["Date"].dt.day
df_epl["Month"] = df_epl["Date"].dt.month 
df_epl["Year"] = df_epl["Date"].dt.year

In [None]:
# Transform the date column into day and month columns and Add into dataframe (Extract days & months from date)
df_epl["Date"] = pd.to_datetime(df_epl["Date"], dayfirst=True)

### <a id='toc3_1_2_'></a>[Adding standings and rankings data to the Dataframe](#toc0_)

The following snippet is used to combine match detailed dataframe - `df_epl`, with team seasonal standings and rankings.

In [None]:
# PART 1 - HELPER FUNCTIONS:

def get_season_start_date(date):
    if int(date.month) <= 7:
      return datetime(int(date.year)-1, 8, 1)
    return datetime(int(date.year), 8, 1)


# PART 2 - ADDING STANDINGS DATA TO df_epl:

homeTeamRankings = []
awayTeamRankings = []
diffRankings = []

for index, row in df_epl.iterrows():
    season = get_season_start_date(row["Date"])
    prev_season = str(season.year-1) + "-" + str(season.year)[-2:] # from df_ranking
    homeTeam = row["HomeTeam"]    
    awayTeam = row["AwayTeam"]

    #filter dataframe ranking for year=season and hometeam=homeTeam
    df_epl_train_filtered_H = df_ranking.copy()
    df_epl_train_filtered_H = df_epl_train_filtered_H[(df_ranking.Season==prev_season) & (df_ranking.Team==homeTeam)]

    #filter dataframe ranking for year=season and awayTeam=awayTeam
    df_epl_train_filtered_A = df_ranking.copy()
    df_epl_train_filtered_A = df_epl_train_filtered_A[(df_ranking.Season==prev_season) & (df_ranking.Team==awayTeam)]
    
    # To keep df consistent, we add Nan to row, so we could remove it later by using df.dropna()
    # TODO: refactory
    # 18 represents team that been promoted from previous league this year. Hence, they don't have ranking from previous league
    # Pos is embbed inside pd.Series object, use values to access values of series.
    # Because Pos should be a int, hence, values always has length 1 or 0 (no data)
    if df_epl_train_filtered_H.empty and df_epl_train_filtered_A.empty:
        homeRanking = 18
        awayRanking = 18
    elif df_epl_train_filtered_H.empty and (not df_epl_train_filtered_A.empty):
        homeRanking = 18
        awayRanking = df_epl_train_filtered_A['Pos'].values[0]
    elif (not df_epl_train_filtered_H.empty) and df_epl_train_filtered_A.empty:
        homeRanking = df_epl_train_filtered_H['Pos'].values[0]
        awayRanking = 18
    else:
        homeRanking = df_epl_train_filtered_H['Pos'].values[0]
        awayRanking = df_epl_train_filtered_A['Pos'].values[0]

    homeTeamRankings.append(homeRanking)
    awayTeamRankings.append(awayRanking)
    diffRankings.append(homeRanking-awayRanking)

df_epl['HomeStanding_PrevSeason'] = homeTeamRankings
df_epl['AwayStanding_PrevSeason'] = awayTeamRankings
df_epl['DiffStanding_PrevSeason'] = diffRankings
df_epl

### <a id='toc3_1_3_'></a>[Adding manager data to the Dataframe](#toc0_)

The following snippet is used to combine the existing dataframe - `df_epl`, with the third dataset consisting team managers.

In [None]:
# PART 1 - CLEAN MANAGER DATA:

# Check the original manager data - note nan's in 'End Date' indicate that manager is still at club
# Here we update the nan values in the dataframe to the current date today
updatedEndDates = []
now = datetime.now() # Get current time
today = now.strftime("%d/%m/%Y")
for index, row in df_manager.iterrows():
    date =  row['End Date']
    if pd.isnull(date):
        updatedEndDates.append(today)
    else:
        updatedEndDates.append(date)
        
# Add these updated End_Dates to the dataframe
df_manager['End_Date'] = updatedEndDates

# Change the names of column index to include '_' instead of space and convert string dates into datetime objects
df_manager['Start_Date'] = pd.to_datetime(df_manager["Start Date"], dayfirst=True)
df_manager['End_Date'] = pd.to_datetime(df_manager["End_Date"], dayfirst=True)



# PART 2 - ADDING MANAGER DATA TO df_epl:

homeTeamManagers = []
awayTeamManagers = []

# Here for each row in the df_epl we will add the manager for home and away team
for index, row in df_epl.iterrows():
    date =  row['Date']
    homeTeam = row["HomeTeam"]
    awayTeam = row["AwayTeam"]

    #filter dataframe ranking for date between start and end of managers and hometeam=homeTeam
    df_epl_train_filtered_H = df_manager.copy()
    df_epl_train_filtered_H = df_epl_train_filtered_H[(df_manager.Start_Date<=date) & (df_manager.End_Date>=date) & (df_manager.Club==homeTeam)]

    #filter dataframe ranking for date between start and end of managers and awayTeam=awayTeam
    df_epl_train_filtered_A = df_manager.copy()
    df_epl_train_filtered_A = df_epl_train_filtered_A[(df_manager.Start_Date<=date) & (df_manager.End_Date>=date) & (df_manager.Club==awayTeam)]
    
    # NOTE - we get some managers as anonymous - this is happening when there is a change of manager (old manager has left, new manager has not started yet) and a match takes place
    if df_epl_train_filtered_H.empty:
        homeManager = "Anonymous"
    else:
        homeManager = df_epl_train_filtered_H['Name'].values[0]
        
    if df_epl_train_filtered_A.empty:
        awayManager = "Anonymous"
    else:
        awayManager = df_epl_train_filtered_A['Name'].values[0]
    
    homeTeamManagers.append(homeManager)
    awayTeamManagers.append(awayManager)

df_epl['Home_Manager'] = homeTeamManagers
df_epl['Away_Manager'] = awayTeamManagers
df_epl

### <a id='toc3_1_4_'></a>[Encoding Categorical Data](#toc0_)

Most models accept only numbers as their input, so we need to encode all categorical data like team names, referees and managers.

In [None]:
# Check if home teams are the same as away teams, same for managers
print(set(df_epl['AwayTeam'].unique()) == set(df_epl['HomeTeam'].unique()))
print(set(df_epl['Home_Manager'].unique()) == set(df_epl['Away_Manager'].unique()))

In [None]:
# Map categorical string data to integer values
team_encoder = LabelEncoder()
df_epl["HomeTeam_Enc"] = team_encoder.fit_transform(df_epl['HomeTeam'])
df_epl["AwayTeam_Enc"] = team_encoder.transform(df_epl['AwayTeam'])

referee_encoder = LabelEncoder()
df_epl["Referee_Enc"] = referee_encoder.fit_transform(df_epl['Referee'])

FTR_encoder = LabelEncoder()
df_epl["FTR_Enc"] = FTR_encoder.fit_transform(df_epl["FTR"])
df_epl["HTR_Enc"] = FTR_encoder.transform(df_epl["HTR"])

manager_encoder = LabelEncoder()
manager_encoder.fit(list(set(df_epl['Home_Manager'].unique()).union(set(df_epl['Away_Manager'].unique()))))
df_epl["Home_Manager_Enc"] = manager_encoder.transform(df_epl['Home_Manager'])
df_epl["Away_Manager_Enc"] = manager_encoder.transform(df_epl['Away_Manager'])

## <a id='toc3_2_'></a>[Data Exploration](#toc0_)

Before extracting and engineering features, we explore the dataset to learn about its specifics. Firstly, we check for visible outliers, existing NaN values and have a look at the output of the `.describe()` summary for all numerical features. 

In [None]:
df_epl.describe().T

In [None]:
df_epl.isnull().values.any()

It looks like there are no NaN values in the dataset. Quick glance at other statistcs gives hope that the data is not corrupted. Let's check if this classification problem can be considered balanced by plotting a histogram of classification labels.

In [None]:
sns.histplot(df_epl, x='FTR')
plt.show()

In [None]:
describe_df = df_epl.describe()
# Graph inspired by https://armantee.github.io/predicting/

home_features = ['FTHG','HTHG','HS','HST','HC','HF','HY','HR']
away_features = ['FTAG','HTAG','AS','AST','AC','AF','AY','AR']
fig=plt.figure(figsize=(18, 8), dpi= 80, facecolor='w', edgecolor='k')

# width of the bars
barWidth = 0.2
bars1 = np.array(describe_df.iloc[1, :][home_features]).flatten()
bars2 = np.array(describe_df.iloc[1, :][away_features]).flatten()
yer1 = np.array(describe_df.iloc[2, :][home_features]).flatten()
yer2 = np.array(describe_df.iloc[2, :][away_features]).flatten()
# The x position of bars
r1 = np.arange(len(bars1.flatten()))
r2 = [x + barWidth for x in r1]
# Create blue bars
plt.bar(r1, bars1, width = barWidth, color = 'orange', edgecolor = 'black', yerr=yer1, capsize=7, label='Home team')
# Create cyan bars
plt.bar(r2, bars2, width = barWidth, color = 'green', edgecolor = 'black', yerr=yer2, capsize=7, label='Away team')
# general layout
plt.xticks([r + barWidth for r in range(len(bars1))], ['Goals Scored', 'Half-time Goals Scored', 'Shots','Shots on target', 'Corners','Fouls','Yellow Cards','Red Cards'])
plt.ylabel('Average Value')
plt.ylim(0)
plt.title("Average values for features split between home and away team.")
plt.legend()
# Show graphic
plt.show()

Although most of standard deviation errors overlap greatly, the home advantage is clearly visible. This needs to be taken into account while working on classifiers. Shots and Shots on target may be especially great indicators for the result. Let's verify this and explore the correlation matrix for the existing features.

In [None]:
plt.figure(figsize=(20,10)) 
sns.heatmap(df_epl.corr(numeric_only="False"), cmap="PiYG", annot= True)
plt.show()

In [None]:
year_dict = []
for i in range(2002, 2023):
    grouped_wins = df_epl[(df_epl['FTR'] == 'H') & (df_epl['Year'] == i)].groupby('HomeTeam')
    grouped_all = df_epl[(df_epl['Year'] == i)].groupby('HomeTeam')
    
    year_dict.append(dict(grouped_wins['FTR'].count()/grouped_all['FTR'].count()))

def get_team_stats(team_name):
    team = []
    for year in range(len(year_dict)):
        try:
            value = year_dict[year][team_name]
            if np.isnan(value):
                value = 0
        except KeyError:
            value = 0
        team.append(value)
    return team

In [None]:
fig, ax = plt.subplots(14, 3, figsize=(12, 40), constrained_layout=True)

for i, team_name in enumerate(df_epl['HomeTeam'].unique()):
    ax[i//3][i%3].set_title(team_name)
    ax[i//3][i%3].grid()
    # ax[i//3][i%3].set_xticks(np.arange(2002, 2023, step=1))
    ax[i//3][i%3].set_ylim(ymin=0)
    y = get_team_stats(team_name)
    x = np.array((range(2002, 2023)))
    ax[i//3][i%3].plot(x, y, '-o')
    if 0 not in set(y):
        a, b = np.polyfit(x, y, 1)
        ax[i//3][i%3].plot(x, a*x+b)
fig.suptitle("Team performance over years - percentage of won matches")
plt.plot()

It's worth noting that most of the teams did not play for the full 20 past years. This is likely caused by league promotions and relegations. 

Let's see if a time of a year has an influence on a match outcome.

In [None]:
counts = {}
month_names = {}
for month in range(1, 13):
    h = 100 * len(df_epl[(df_epl['Month'] == month) & (df_epl['FTR'] == 'H')])/len(df_epl[(df_epl['Month'] == month)])
    d = 100 * len(df_epl[(df_epl['Month'] == month) & (df_epl['FTR'] == 'D')])/len(df_epl[(df_epl['Month'] == month)])
    a = 100 * len(df_epl[(df_epl['Month'] == month) & (df_epl['FTR'] == 'A')])/len(df_epl[(df_epl['Month'] == month)])
    counts[month_name[month]] = [h, a, d]

In [None]:
# inspired by https://matplotlib.org/stable/gallery/lines_bars_and_markers/horizontal_barchart_distribution.html#sphx-glr-gallery-lines-bars-and-markers-horizontal-barchart-distribution-py

category_names = ['H', 'A', 'D']
results = counts

labels = list(results.keys())
data = np.array(list(results.values()))
data_cum = data.cumsum(axis=1)
category_colors = plt.colormaps['twilight'](
    np.linspace(0.15, 0.85, data.shape[1]))

fig, ax = plt.subplots(figsize=(10, 8))
ax.invert_yaxis()
ax.xaxis.set_visible(False)
ax.set_xlim(0, np.sum(data, axis=1).max())

for i, (colname, color) in enumerate(zip(category_names, category_colors)):
    widths = data[:, i]
    starts = data_cum[:, i] - widths
    rects = ax.barh(labels, widths, left=starts, height=0.5,
                    label=colname, color=color)

    r, g, b, _ = color
    text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
    ax.bar_label(rects, label_type='center', fmt='%.1f', color=text_color)
    
ax.legend(ncol=len(category_names), bbox_to_anchor=(0, 0),
            loc='upper left', fontsize='small')
plt.title("Average match results (%) w.r.t. months across years 2000-2022")
plt.show()

In [None]:
sns.histplot(data=df_epl, x='Month', binwidth=0.3)

In general, it looks like results across months are the similar, apart from June/July and March. This may be the outcome of much less matches being played during those months. Let's investigate this.

Now, let's see how the number of matches won by a team changes across years.

# <a id='toc4_'></a>[Feature Engineering](#toc0_)

In [None]:
# Transform the date column into day and month columns and Add into dataframe (Extract days & months from date)
df_epl["Day"] = df_epl["Date"].dt.day
df_epl["Month"] = df_epl["Date"].dt.month 
df_epl["Year"] = df_epl["Date"].dt.year

# Turn the catergorical data into labels
df_epl["HomeTeam_Enc"] = df_epl["HomeTeam"].astype("category").cat.codes
df_epl["AwayTeam_Enc"] = df_epl["AwayTeam"].astype("category").cat.codes
df_epl["Referee_Enc"] = df_epl["Referee"].astype("category").cat.codes

FTR_encoder = LabelEncoder()
FTR_Enc = FTR_encoder.fit_transform(df_epl["FTR"])
df_epl["FTR_Enc"] = FTR_Enc
HTR_Enc = FTR_encoder.fit_transform(df_epl["HTR"])
df_epl["HTR_Enc"] = HTR_Enc

df_epl["Home_Manager_Enc"] = df_epl["Home_Manager"].astype("category").cat.codes
df_epl["Away_Manager_Enc"] = df_epl["Away_Manager"].astype("category").cat.codes

### <a id='toc4_1_1_'></a>[Adding Average Past Match Statistics & Past Season % Number Of Wins](#toc0_)

_HISTORY -> We add averages of past stats between the specific two teams in question. We obtain these stats (for each row of df_epl) by filtering the df_epl dataframe for matches ONLY between HomeTeam and AwayTeam that took place before the match date. Then take an average of the columns (with stats) like HR, AR, etc (of filtered dataframe). This happens for each row. This will provide us with the average past stats for games played in the past between the specific two teams.

_AVG -> We add averages of past stats between the for each of two teams in question. We obtain these stats (for each row of df_epl) by filtering the df_epl dataframe for matches between HomeTeam against ALL other teams that took place before the match date in that current season. Similar is done for the AwayTeam. Then take an average of the columns (with stats) like HR, AR, etc (of the filtered dataframe). This process happens for each row. This will provide us with the average past stats for the all HomeTeam games in that season and all AwayTeam games in that season.

HW_AVG & AW_AVG -> The number of past wins are calculated by summing the number of wins by the team in the season.

In [None]:
# PART 1 - THESE ARE HELPER FUNCTIONS WE NEED:

def get_season_start_date(date):
    if date.month <= 7:
        return datetime(date.year-1, 8, 1)
    return datetime(date.year-0, 8, 1)

def filter_dataframe_by_hometeam_history(df, date, HomeTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Date
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.HomeTeam_Enc==HomeTeam)]

    # Return filtered dataframe
    return df_filtered

def filter_dataframe_by_awayteam_history(df, date, AwayTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Date
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.AwayTeam_Enc==AwayTeam)]

    # Return filtered dataframe
    return df_filtered

def filter_dataframe_by_hometeam_recent_season(df, date, HomeTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Dateinput(first day of season) && HomeTeam=input(HomeTeam)
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.Date>get_season_start_date(date)) & (df_filtered.HomeTeam_Enc==HomeTeam)]

    # Return filtered dataframe
    return df_filtered

def filter_dataframe_by_awayteam_recent_season(df, date, AwayTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Dateinput(first day of season) && HomeTeam=input(HomeTeam)
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.Date>get_season_start_date(date)) & (df_filtered.AwayTeam_Enc==AwayTeam)]

    # Return filtered dataframe
    return df_filtered

# This function takes as input the filtered dataframe from previous cell, features to average and a dictionary,it then appends an average of each feature to the dictionary
def average_columns(avg_features, filtered_df):
    for feature in avg_features.keys():
        df_col_means = filtered_df[feature].mean()
        avg_features[feature].append(df_col_means)
        
# This function takes as input a filtered dataframe from previous cell, and a list, it then appends the % number of home/away wins in past
def find_number_of_wins(number_of_wins_list, filtered_df, team):
    df_filtered_ftr = filtered_df.copy()
    total_games = df_filtered_ftr.shape[0]
    if total_games == 0:
        number_of_wins_list.append(np.nan)
        return
    number_of_wins = df_filtered_ftr[(df_filtered_ftr.FTR==team)].shape[0]
    number_of_wins_list.append(number_of_wins/total_games)

        
# PART 2 - CREATE FEATURES & ADDING THEM TO DATAFRAME:

# These are the features we want to get averages for home team
avg_features_home_hist = {
                  "FTHG": [],
                  "HTHG": [],
                  "HS"  : [],
                  "HST" : [],
                  "HF"  : [],
                  "HC"  : [],
                  "HY"  : [],
                  "HR"  : []
              }

# These are the features we want to get averages for away team
avg_features_away_hist = {
                  "FTAG": [],
                  "HTAG": [],
                  "AS"  : [],
                  "AST" : [],
                  "AF"  : [],
                  "AC"  : [],
                  "AY"  : [],
                  "AR"  : []
                }

# These are the features we want to get averages for home team
avg_features_home_recent = {
                  "FTHG": [],
                  "HTHG": [],
                  "HS"  : [],
                  "HST" : [],
                  "HF"  : [],
                  "HC"  : [],
                  "HY"  : [],
                  "HR"  : []
              }

# These are the features we want to get averages for away team
avg_features_away_recent = {
                  "FTAG": [],
                  "HTAG": [],
                  "AS"  : [],
                  "AST" : [],
                  "AF"  : [],
                  "AC"  : [],
                  "AY"  : [],
                  "AR"  : []
                }

number_of_wins_HOME = []
number_of_wins_AWAY = []


# Run the two functions on each row of the df_epl and fill the dictionary
for index, row in df_epl.iterrows():
    # Filter the dataframe to only show matches played between those teams and before the certain date
    df_epl_train_average_hometeam_history = filter_dataframe_by_hometeam_history(df_epl, row["Date"],row["HomeTeam_Enc"])
    df_epl_train_average_awayteam_history = filter_dataframe_by_awayteam_history(df_epl, row["Date"],row["HomeTeam_Enc"])
    df_epl_train_average_hometeam_recent_season = filter_dataframe_by_hometeam_recent_season(df_epl, row["Date"],row["HomeTeam_Enc"])
    df_epl_train_average_awayteam_recent_season = filter_dataframe_by_awayteam_recent_season(df_epl, row["Date"],row["AwayTeam_Enc"])
    # Get averages from the filtered dataframe and add to the dictionary
    average_columns(avg_features_home_hist, df_epl_train_average_hometeam_history)
    average_columns(avg_features_away_hist, df_epl_train_average_awayteam_history)
    average_columns(avg_features_home_recent, df_epl_train_average_hometeam_recent_season)
    average_columns(avg_features_away_recent, df_epl_train_average_awayteam_recent_season)
    # Get number_of_wins from the filtered dataframe and add to list
    find_number_of_wins(number_of_wins_HOME, df_epl_train_average_hometeam_recent_season, "H")
    find_number_of_wins(number_of_wins_AWAY, df_epl_train_average_awayteam_recent_season, "A")

    
# Add features to dataframe
for feature in avg_features_home_hist.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_home_hist[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_HISTORY"] = feature_vals



# Add features to dataframe
for feature in avg_features_away_hist.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_away_hist[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_HISTORY"] = feature_vals


for feature in avg_features_home_recent.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_home_recent[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_AVG"] = feature_vals


for feature in avg_features_away_recent.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_away_recent[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_AVG"] = feature_vals
    
# Add the past % number of wins
df_epl["HW_AVG"] = number_of_wins_HOME
df_epl["AW_AVG"] = number_of_wins_AWAY


# Drop any rows with nan
df_epl = df_epl.dropna()
df_epl

In [None]:
# PART 1 - THESE ARE HELPER FUNCTIONS WE NEED:

def get_season_start_date(date):
    if date.month <= 7:
        return datetime(date.year-1, 8, 1)
    return datetime(date.year-0, 8, 1)

def filter_dataframe_by_hometeam_history(df, date, HomeTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Date
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.HomeTeam_Enc==HomeTeam)]

    # Return filtered dataframe
    return df_filtered

def filter_dataframe_by_awayteam_history(df, date, AwayTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Date
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.AwayTeam_Enc==AwayTeam)]

    # Return filtered dataframe
    return df_filtered

def filter_dataframe_by_hometeam_recent_season(df, date, HomeTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Dateinput(first day of season) && HomeTeam=input(HomeTeam)
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.Date>get_season_start_date(date)) & (df_filtered.HomeTeam_Enc==HomeTeam)]

    # Return filtered dataframe
    return df_filtered

def filter_dataframe_by_awayteam_recent_season(df, date, AwayTeam):
    # Convert the input string date into datetime
    date = pd.to_datetime(date, dayfirst=True)

    # Filter the dataframe to include only rows where Dateinput(first day of season) && HomeTeam=input(HomeTeam)
    df_filtered = df.copy()
    df_filtered = df_filtered[(df_filtered.Date<date) & (df_filtered.Date>get_season_start_date(date)) & (df_filtered.AwayTeam_Enc==AwayTeam)]

    # Return filtered dataframe
    return df_filtered

# This function takes as input the filtered dataframe from previous cell, features to average and a dictionary,it then appends an average of each feature to the dictionary
def average_columns(avg_features, filtered_df):
    for feature in avg_features.keys():
        df_col_means = filtered_df[feature].mean()
        avg_features[feature].append(df_col_means)
        
# This function takes as input a filtered dataframe from previous cell, and a list, it then appends the % number of home/away wins in past
def find_number_of_wins(number_of_wins_list, filtered_df, team):
    df_filtered_ftr = filtered_df.copy()
    total_games = df_filtered_ftr.shape[0]
    if total_games == 0:
        number_of_wins_list.append(np.nan)
        return
    number_of_wins = df_filtered_ftr[(df_filtered_ftr.FTR==team)].shape[0]
    number_of_wins_list.append(number_of_wins/total_games)

        
# PART 2 - CREATE FEATURES & ADDING THEM TO DATAFRAME:

# These are the features we want to get averages for home team
avg_features_home_hist = {
                  "FTHG": [],
                  "HTHG": [],
                  "HS"  : [],
                  "HST" : [],
                  "HF"  : [],
                  "HC"  : [],
                  "HY"  : [],
                  "HR"  : []
              }

# These are the features we want to get averages for away team
avg_features_away_hist = {
                  "FTAG": [],
                  "HTAG": [],
                  "AS"  : [],
                  "AST" : [],
                  "AF"  : [],
                  "AC"  : [],
                  "AY"  : [],
                  "AR"  : []
                }

# These are the features we want to get averages for home team
avg_features_home_recent = {
                  "FTHG": [],
                  "HTHG": [],
                  "HS"  : [],
                  "HST" : [],
                  "HF"  : [],
                  "HC"  : [],
                  "HY"  : [],
                  "HR"  : []
              }

# These are the features we want to get averages for away team
avg_features_away_recent = {
                  "FTAG": [],
                  "HTAG": [],
                  "AS"  : [],
                  "AST" : [],
                  "AF"  : [],
                  "AC"  : [],
                  "AY"  : [],
                  "AR"  : []
                }

number_of_wins_HOME = []
number_of_wins_AWAY = []


# Run the two functions on each row of the df_epl and fill the dictionary
for index, row in df_epl.iterrows():
    # Filter the dataframe to only show matches played between those teams and before the certain date
    df_epl_train_average_hometeam_history = filter_dataframe_by_hometeam_history(df_epl, row["Date"],row["HomeTeam_Enc"])
    df_epl_train_average_awayteam_history = filter_dataframe_by_awayteam_history(df_epl, row["Date"],row["HomeTeam_Enc"])
    df_epl_train_average_hometeam_recent_season = filter_dataframe_by_hometeam_recent_season(df_epl, row["Date"],row["HomeTeam_Enc"])
    df_epl_train_average_awayteam_recent_season = filter_dataframe_by_awayteam_recent_season(df_epl, row["Date"],row["AwayTeam_Enc"])
    # Get averages from the filtered dataframe and add to the dictionary
    average_columns(avg_features_home_hist, df_epl_train_average_hometeam_history)
    average_columns(avg_features_away_hist, df_epl_train_average_awayteam_history)
    average_columns(avg_features_home_recent, df_epl_train_average_hometeam_recent_season)
    average_columns(avg_features_away_recent, df_epl_train_average_awayteam_recent_season)
    # Get number_of_wins from the filtered dataframe and add to list
    find_number_of_wins(number_of_wins_HOME, df_epl_train_average_hometeam_recent_season, "H")
    find_number_of_wins(number_of_wins_AWAY, df_epl_train_average_awayteam_recent_season, "A")

    
# Add features to dataframe
for feature in avg_features_home_hist.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_home_hist[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_HISTORY"] = feature_vals


# Add features to dataframe
for feature in avg_features_away_hist.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_away_hist[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_HISTORY"] = feature_vals


for feature in avg_features_home_recent.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_home_recent[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_AVG"] = feature_vals


for feature in avg_features_away_recent.keys():
    # Get the list of averages for a certain feature from the dicitonary
    feature_vals = avg_features_away_recent[feature]
    # Add the list of averages into the dataframe for that certain feature
    df_epl.loc[:,feature + "_AVG"] = feature_vals
    
# Add the past % number of wins
df_epl["HW_AVG"] = number_of_wins_HOME
df_epl["AW_AVG"] = number_of_wins_AWAY


# Drop any rows with nan
df_epl = df_epl.dropna()
df_epl

### <a id='toc4_1_2_'></a>[Adding Expected Goals](#toc0_)

The expected goals for each team are calculated using a polynomial regression classifier that is trained using  some of the average past statistics. Then for each row of df_epl, we predict the expected goals for HomeTeam and AwayTeam using the classifier.

In [None]:
# PART 1 - CREATE EXPECTED GOALS REGRESSION MODEL:

# Here we aim to create a 'expected or predicted goals for a HomeTeam' feature based upon past average stats
min_mse_home = float('inf')
min_mse_away = float('inf')

# Create the design matrix
X_H = df_epl.loc[:,['Day', 'Month', 'HomeTeam_Enc', 'FTHG_AVG', 'HTHG_AVG', 'HS_AVG']].values
y_H = df_epl.loc[:,'FTHG'].values
X_H_train, X_H_test, y_H_train, y_H_test = model_selection.train_test_split(X_H, y_H, test_size=0.2, shuffle=False)

# Similar idea for AwayTeam
X_A = df_epl.loc[:,['Day', 'Month', 'AwayTeam_Enc', 'FTAG_AVG', 'HTAG_AVG', 'AS_AVG']].values
y_A = df_epl.loc[:,'FTAG'].values
X_A_train, X_A_test, y_A_train, y_A_test = model_selection.train_test_split(X_A, y_A, test_size=0.2, shuffle=False)

# Here we use a polynomial regression classifier - and select best order:
for i in range(1,5):
    # Select order
    poly = PolynomialFeatures(degree=i)

    # Transform the features
    X_H_train_transform = poly.fit_transform(X_H_train)
    X_H_test_transform = poly.fit_transform(X_H_test)

    LR_Model_HOME_EG = LinearRegression()
    # Fit the model using training data
    LR_Model_HOME_EG.fit(X_H_train_transform, y_H_train)
    # Make predictions using the model we have created
    LR_H_predictions_test = LR_Model_HOME_EG.predict(X_H_test_transform)
#     # Check the mean square error(MSE) for HomeTeam Expected Goals
#     print(i, mean_squared_error(LR_H_predictions_test, y_H_test))

    # Transform the features
    X_A_train_transform = poly.fit_transform(X_A_train)
    X_A_test_transform = poly.fit_transform(X_A_test)

    LR_Model_AWAY_EG = LinearRegression()
    # Fit the model using training data
    LR_Model_AWAY_EG.fit(X_A_train_transform, y_A_train)
    # Make predictions using the model we have created
    LR_A_predictions_test = LR_Model_AWAY_EG.predict(X_A_test_transform)
#     # Check the mean square error(MSE) for AwayTeam Expected Goals
#     print(i, mean_squared_error(LR_A_predictions_test, y_A_test))

    curr_mse_home = mean_squared_error(LR_H_predictions_test, y_H_test)
    curr_mse_away = mean_squared_error(LR_A_predictions_test, y_A_test)

    if curr_mse_home < min_mse_home:
        best_poly1 = poly
        best_model_home = LR_Model_HOME_EG
        min_mse_home = curr_mse_home

    if curr_mse_away < min_mse_away:
        best_poly2 = poly
        best_model_away = LR_Model_AWAY_EG
        min_mse_away = curr_mse_away
        
        
# PART 2 - ADD EXPECTED GOALS:

# Using the two regression classfiers above, predict the number of goals that the Home and Away teams will score for each row in the dataframe:
HomeExGoals = []
AwayExGoals = []

# For each row, predict the home and away expected goals
for index, row in df_epl.iterrows():
    X_Home_features = np.array([[row["Day"],row["Month"],row["HomeTeam_Enc"],row["FTHG_AVG"],row["HTHG_AVG"],row["HS_AVG"]]])
    X_Away_features = np.array([[row["Day"],row["Month"],row["AwayTeam_Enc"],row["FTAG_AVG"],row["HTAG_AVG"],row["AS_AVG"]]])
    # Transform features since we use polynomial regression
    X_Home_features_transform = best_poly1.fit_transform(X_Home_features)
    X_Away_features_transform = best_poly2.fit_transform(X_Away_features)
    # Use the best polynomial classifier - Note the prediction is a 1 by 1 vector
    ex_home_goals = best_model_home.predict(X_Home_features_transform)[0]
    ex_away_goals = best_model_away.predict(X_Away_features_transform)[0]
    # Add prediciton to list
    HomeExGoals.append(ex_home_goals)
    AwayExGoals.append(ex_away_goals)

# Add this data into the dataframe
df_epl["Ex_Goals_Home"] = HomeExGoals
df_epl["Ex_Goals_Away"] = AwayExGoals


# # Turn the catergorical data into labels using same method from before
# df_epl["AwayTeam_Enc"] = df_epl["AwayTeam"].astype("category").cat.codes
# df_epl["HomeTeam_Enc"] = df_epl["HomeTeam"].astype("category").cat.codes
# df_epl = df_epl.drop(['HomeTeam', 'AwayTeam', 'Div'], axis=1)
# # Transform the date column into day and month columns and Add into dataframe (Extract days & months from date)
# df_epl["Date"] = pd.to_datetime(df_epl["Date"])
# df_epl["Day"] = df_epl["Date"].dt.day
# df_epl["Month"] = df_epl["Date"].dt.month 
# df_epl["Year"] = df_epl["Date"].dt.year

### <a id='toc4_1_3_'></a>[Removing Pre-encoded Data](#toc0_)

In [None]:
# df_epl.drop(['FTR','HTR','Referee','Home_Manager','Away_Manager'],inplace=True,axis=1)
df_epl.drop(['HTR','Referee','Home_Manager','Away_Manager'],inplace=True,axis=1)

## <a id='toc4_2_'></a>[Breakdown of Features In The Dataframe/Dataset](#toc0_)

## <a id='toc4_3_'></a>[Final Dataframe containing all features](#toc0_)

In [None]:
df_epl

### <a id='toc4_3_1_'></a>[PLOT COOR MATRIX AGAIN (DEBUG)](#toc0_)

In [None]:
plt.figure(figsize=(20,10)) 
sns.heatmap(df_epl.corr(numeric_only="False"), cmap="PiYG", annot= False)
plt.show()

# <a id='toc5_'></a>[Auxiliary Functions + Classifier interfaces](#toc0_)

## <a id='toc5_1_'></a>[Evaluation helpers](#toc0_)

In [None]:
def evaluate_report(y_pred, y_test):
  y_pred = y_pred.ravel()
  y_test = y_test.ravel()

  print("Balanced Accuracy: ", balanced_accuracy_score(y_test,y_pred))
  print("Accuracy: ", accuracy_score(y_test, y_pred))
  # handle f1 score zero division
  # https://stackoverflow.com/questions/62326735/metrics-f1-warning-zero-division
  print(classification_report(y_test, y_pred, zero_division=0))

  ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
  plt.show()

In [None]:
# Used for summary part of each feature Set
model_acc_dict = {
  'RG': 0,
  'DT': 0,
  'RF': 0,
  'KNN': 0,
  'SVM': 0,
  'XGB': 0,
  'NN': 0
}

compare_feature_sets_dict = {
  'Feature Set 1': {'RG': 0,'DT': 0,'RF': 0,'KNN': 0,'SVM': 0,'XGB': 0,'NN': 0},
  'Feature Set 2': {'RG': 0,'DT': 0,'RF': 0,'KNN': 0,'SVM': 0,'XGB': 0,'NN': 0},
  'Feature Set 3': {'RG': 0,'DT': 0,'RF': 0,'KNN': 0,'SVM': 0,'XGB': 0,'NN': 0},
  'Feature Set 4': {'RG': 0,'DT': 0,'RF': 0,'KNN': 0,'SVM': 0,'XGB': 0,'NN': 0},
  'Feature Set 5': {'RG': 0,'DT': 0,'RF': 0,'KNN': 0,'SVM': 0,'XGB': 0,'NN': 0},
}

## <a id='toc5_2_'></a>[Plotting helpers](#toc0_)

In [None]:
def summary_hist(model_acc_dict=model_acc_dict, title="Summary of Models - Feature Set 1", fig_size=(9, 9)):
  model_names = list(model_acc_dict.keys())
  '''
  model_names = ["Random Guess", "Decision Tree", 
                  "Random Forest", "K Nearest Neighbors",
                  "Support Vector Machine", "XGB",
                  "Nerual Network"]
  '''
  
  x_label = "Balanced Accuracy (%)"
  y_label = "Models trained"

  accs = model_acc_dict.values()

  fig = plt.figure(figsize=fig_size)
  ax = fig.gca()
  p1 = ax.barh(model_names, accs)

  ax.set_title(title, fontsize=12)
  ax.set_xlabel(x_label)
  ax.set_ylabel(y_label)

  for i, v in enumerate(accs):
      ax.text(v//2, i, str(v), color='white', fontsize=9, ha='left', va='center')

  fig.tight_layout()
  plt.show()

In [None]:
'''
Function used to plot changes on training loss & cross validation loss
'''
def plot_train_test_acc(results , scoring, param_x = "param_max_depth",title="GridSearchCV evaluation", xlabel="max_depth", ylabel="Score",xlim=(0,100), ylim=(0.4,1), fig_size=(9, 9)):

    # REF:https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py
    plt.figure(figsize=fig_size)
    plt.title(title, fontsize=16)

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

    ax = plt.gca()
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

    # Get the regular numpy array from the MaskedArray
    X_axis = np.array(results[param_x].data, dtype=float)

    for scorer, color in zip(sorted(scoring), ["g", "k"]):
        for sample, style in (("train", "--"), ("test", "-")):
            sample_score_mean = results["mean_%s_%s" % (sample, scorer)]
            sample_score_std = results["std_%s_%s" % (sample, scorer)]
            ax.fill_between(
                X_axis,
                sample_score_mean - sample_score_std,
                sample_score_mean + sample_score_std,
                alpha=0.1 if sample == "test" else 0,
                color=color,
            )
            # change label(test) -> cross validation to avoid confusion
            if sample == "test":
              sample = "cross validation"
            ax.plot(
                X_axis,
                sample_score_mean,
                style,
                color=color,
                alpha=1 if sample == "cross validation" else 0.7,
                label="%s (%s)" % (scorer, sample),
            )

        best_index = np.nonzero(results["rank_test_%s" % scorer] == 1)[0][0]
        best_score = results["mean_test_%s" % scorer][best_index]

        # Plot a dotted vertical line at the best score for that scorer marked by x
        ax.plot(
            [
                X_axis[best_index],
            ]
            * 2,
            [0, best_score],
            linestyle="-.",
            color=color,
            marker="x",
            markeredgewidth=3,
            ms=8,
        )
    
        # Annotate the best score for that scorer
        ax.annotate("%0.4f" % best_score, (X_axis[best_index], best_score + 0.005))

    plt.legend(loc="best")
    plt.grid(False)
    plt.show()

## <a id='toc5_3_'></a>[Metrics and classifiers](#toc0_)

### <a id='toc5_3_1_'></a>[scoring metrics and define cross validation data split](#toc0_)

In [None]:
"""
NOTE: Scoring metrics
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
"""
scoring = {"Accuracy": "accuracy", "Balanced_accuracy": "balanced_accuracy"}
refit = "Balanced_accuracy"
#NOTE: test on larger splits
cv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)

#### <a id='toc5_3_1_1_'></a>[helper function for producing report](#toc0_)

In [None]:
def clf_eval(clf, X_test, y_test, featureset_name, classifier_name):
    # Make predictions using the model we have created
    y_pred = clf.predict(X_test).ravel()
    # Reconverting prediction values (i.e. 0, 1 or 2) back into (H, D or A) using the FTR_encoder defined in earlier cell
    y_pred = FTR_encoder.inverse_transform(y_pred)
    y_test = y_test.ravel()

    evaluate_report(y_pred, y_test)
    model_acc_dict[classifier_name] = round(balanced_accuracy_score(y_test,y_pred)*100, 2)
    # NOTE: we are not logging values for random guesser here.
    compare_feature_sets_dict[featureset_name][classifier_name] = round(balanced_accuracy_score(y_test,y_pred)*100, 2)

### <a id='toc5_3_2_'></a>[Classifiers](#toc0_)

#### <a id='toc5_3_2_1_'></a>[Random Guesses](#toc0_)

In [None]:
class RandomGuessClassifier(BaseEstimator, ClassifierMixin):
    """Custom implementation of the random guess classifier. 
    
    Compatible with sklearn.metric.plot_confusion_matrix function.
    """
    def __init__(self) -> None:
        self._labels: np.ndarray = None

    def fit(self, x_train: np.ndarray, y_train: np.ndarray) -> None:
        """Does not use x_train at all (who would actually want to use it lol). 
        Saves available labels from y_train.
        """
        self._labels = np.unique(y_train)

    def predict(self, x_test: np.ndarray) -> np.ndarray:
        """For every sample in x_test, chooses a label from self._labels at random"""
        np.random.seed(42)
        return np.array(list(map(lambda _: np.random.choice(self._labels, 1), x_test)))

In [None]:
def fit_rg_n_cv(X_train, y_train):
    rgc = RandomGuessClassifier()
    rgc.fit(X_train, y_train)
    return rgc

#### <a id='toc5_3_2_2_'></a>[Decision Tree Classifier](#toc0_)

In [None]:
def fit_dt_n_cv(X_train, y_train):
    # Create an empty Tree model
    dt = DecisionTreeClassifier(random_state=42)
    # Fit the model using training data
    dt.fit(X_train, y_train)
    return dt

In [None]:
"""Decision Tree"""
def fit_DT(X, y):
    """ Parameters can be tweaked for regularization (more options see sklearn documentation)
    {'ccp_alpha': 0.0,
    'class_weight': None,
    'criterion': 'gini',
    'max_depth': None,
    'max_features': None,
    'max_leaf_nodes': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'random_state': None,
    'splitter': 'best'}
    """
 
    classifier = DecisionTreeClassifier(random_state=42)
    print(list(np.linspace(1, X.shape[1], X.shape[1]//5, dtype=int)))
    param_grid = {'max_depth': list(np.linspace(1, X.shape[1], X.shape[1]//5, dtype=int)),
                  'max_features': list(np.linspace(1, X.shape[1], X.shape[1]//5, dtype=int))}
    # NOTE: more option eg. custom scoring metric avaliable -> see sklearn doc
    # n_jobs used for optimization (use all processor)
    # verbose -> display detail (0,1,>1) higher -> more detailed
    grid = GridSearchCV(
        classifier, 
        param_grid=param_grid, 
        cv=cv, 
        verbose=1, 
        n_jobs=-1, 
        scoring=scoring, 
        return_train_score=True,
        refit=refit)
    clf = grid.fit(X, y)

    return clf

#### <a id='toc5_3_2_3_'></a>[Random Forest Classifier](#toc0_)

In [None]:
def fit_rf_n_cv(X_train, y_train):
    # Create an empty Random Forest model
    rf = RandomForestClassifier(random_state=42)
    # Fit the model using training data
    rf.fit(X_train, y_train)
    return rf

In [None]:
def fit_RF(X, y):
    """ Parameters can be tweaked for regularization (more options see sklearn documentation)
    {'bootstrap': True,
    'ccp_alpha': 0.0,
    'class_weight': None,
    'criterion': 'gini',
    'max_depth': None,
    'max_features': 'sqrt',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 100,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False}
    """
    
    classifier = RandomForestClassifier(random_state=42)
    param_grid = {'max_depth':list(np.linspace(1, X.shape[1], X.shape[1]//5, dtype=int)),
                  'max_features': list(np.linspace(1, X.shape[1], X.shape[1]//5, dtype=int))}
    # NOTE: more option eg. custom scoring metric avaliable -> see sklearn doc
    # n_jobs used for optimization (use all processor)
    # verbose -> display detail (0,1,>1) higher -> more detailed
    grid = GridSearchCV(
        classifier, 
        param_grid=param_grid, 
        cv=cv, 
        verbose=1, 
        n_jobs=-1, 
        scoring=scoring, 
        return_train_score=True,
        refit=refit)
    clf = grid.fit(X, y)

    return clf

#### <a id='toc5_3_2_4_'></a>[K-Nearest Neighbours (KNN) Classifier](#toc0_)

In [None]:
def fit_knn_n_cv(X_train, y_train):
    # Create an empty KNN model
    knn = KNeighborsClassifier()
    # Fit the model using training data
    knn.fit(X_train, y_train)
    return knn

In [None]:
def fit_KNN(X, y):
    """ Parameters can be tweaked for regularization (more options see sklearn documentation)
    {'algorithm': 'auto',
    'leaf_size': 30,
    'metric': 'minkowski',
    'metric_params': None,
    'n_jobs': None,
    'n_neighbors': 5,
    'p': 2,
    'weights': 'uniform'}
    """
    classifier = KNeighborsClassifier()
    # Only tweaked the depth here as an example
    param_grid = {'n_neighbors':list(np.linspace(start=3, stop=100, num=50, dtype=int))}
    # NOTE: more option eg. custom scoring metric avaliable -> see sklearn doc
    # n_jobs used for optimization (use all processor)
    # verbose -> display detail (0,1,>1) higher -> more detailed
    grid = GridSearchCV(
        classifier, 
        param_grid=param_grid, 
        cv=cv, 
        verbose=1, 
        n_jobs=-1, 
        scoring=scoring, 
        return_train_score=True,
        refit=refit)
    clf = grid.fit(X, y)

    return clf

#### <a id='toc5_3_2_5_'></a>[Support Vector Machine (SVM) Classifier](#toc0_)

In [None]:
def fit_svm_n_cv(X_train, y_train):
    # Create an empty svm classifier model with RBF Kernal
    svc = svm.SVC(kernel='rbf')
    # Fit the model using training data
    svc.fit(X_train, y_train)
    return svc

In [None]:
def fit_SVM(X, y):
    """ Parameters can be tweaked for regularization (more options see sklearn documentation)
    {'C': 1.0,
    'break_ties': False,
    'cache_size': 200,
    'class_weight': None,
    'coef0': 0.0,
    'decision_function_shape': 'ovr',
    'degree': 3,
    'gamma': 'scale',
    'kernel': 'rbf',
    'max_iter': -1,
    'probability': False,
    'random_state': None,
    'shrinking': True,
    'tol': 0.001,
    'verbose': False}
    """
    classifier = svm.SVC()
    # Only tweaked the depth here as an example
    param_grid = {'C':[1.0,5.0,10.0,15.0, 20.0, 25.0, 30.0]
                  }
    # NOTE: more option eg. custom scoring metric avaliable -> see sklearn doc
    # n_jobs used for optimization (use all processor)
    # verbose -> display detail (0,1,>1) higher -> more detailed
    grid = GridSearchCV(
        classifier, 
        param_grid=param_grid, 
        cv=cv, 
        verbose=1, 
        n_jobs=-1, 
        scoring=scoring, 
        return_train_score=True,
        refit=refit)
    clf = grid.fit(X, y)

    return clf

#### <a id='toc5_3_2_6_'></a>[XGB](#toc0_)

In [None]:
def fit_xgb_n_cv(X_train, y_train):
    # create a baseline XGBoost classifier
    # changed use_label_encoder=False here 

    ## NOTE: USE CPU
    #xgb = XGBClassifier(use_label_encoder=False, eval_metric="mlogloss")

    ## NOTE: USE GPU
    xgb = XGBClassifier(eval_metric="mlogloss", tree_method="gpu_hist", gpu_id=0)

    # Fit the model using training data
    xgb.fit(X_train, y_train)
    return xgb

In [None]:
def fit_XGB(X, y):

    ## NOTE: USE CPU
    #classifier = XGBClassifier(use_label_encoder=False, eval_metric="mlogloss")

    ## NOTE: USE GPU
    classifier = XGBClassifier(eval_metric="mlogloss", tree_method="gpu_hist", gpu_id=0)
    
    # Only tweaked the depth here as an example
    param_grid = {
        "booster": ['gbtree', 'dart'],
        "learning_rate": [0.1, 0.3, 0.5],
        "n_estimators": [5, 10, 20, 50, 100],
        "max_depth": [5, 10, 20, 50, 100],
        "tree_method": ['exact', 'approx', 'hist'],
        "eval_metric": ['mlogloss']
    }
    # NOTE: more option eg. custom scoring metric avaliable -> see sklearn doc
    # n_jobs used for optimization (use all processor)
    # verbose -> display detail (0,1,>1) higher -> more detailed
    grid = GridSearchCV(
        classifier, 
        param_grid=param_grid, 
        cv=cv, 
        verbose=1, 
        n_jobs=-1, 
        scoring=scoring, 
        return_train_score=True,
        refit=refit)
    clf = grid.fit(X, y)

    return clf

#### <a id='toc5_3_2_7_'></a>[Neural Network](#toc0_)

##### <a id='toc5_3_2_7_1_'></a>[Build function of the NN](#toc0_)

In [None]:
def build_clf(input_size):

    # Clean session so when running CV in parallel, it does not cause OOM
    tf.keras.backend.clear_session()

    model = tf.keras.Sequential([
        tf.keras.layers.Dense(input_size, activation='relu', input_dim=input_size),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(3, activation='softmax')
    ])

    model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                metrics=['accuracy'],
                weighted_metrics=['accuracy']
                )
    return model

In [None]:
def fit_nn_n_cv(X_train, y_train):
    # warp this obj into a sklearn classifier
    model_warpped = KerasClassifier(model=build_clf, input_size=X_train.shape[1], epochs=20, verbose=1)
    #print(model_warpped.get_params()) 

    # balanced accuracy weights
    counts = np.unique(y_train, return_counts=True)
    class_weights = dict(zip(counts[0], np.reciprocal(counts[1].astype('float64'))))

    model_warpped.fit(X_train, y_train, class_weight=class_weights)

    return model_warpped

In [None]:
def fit_NN(X_train, y_train):
    '''
    {'model': <function build_clf at 0x7fad83129870>, 
    'build_fn': None, 'warm_start': False, 
    'random_state': None, 
    'optimizer': 'rmsprop', 
    'loss': None, 
    'metrics': None, 
    'batch_size': None, 
    'validation_batch_size': None, 
    'verbose': 1, 
    'callbacks': None, 
    'validation_split': 0.0, 
    'shuffle': True, 
    'run_eagerly': False, 
    'epochs': 10, 
    'input_size': 28, 
    'class_weight': None}
    '''

    # warp this obj into a sklearn classifier
    model_warpped = KerasClassifier(model=build_clf, input_size=X_train.shape[1], epochs=10)
    print(model_warpped.get_params()) 

    # balanced accuracy weights
    counts = np.unique(y_train, return_counts=True)
    class_weights = dict(zip(counts[0], np.reciprocal(counts[1].astype('float64'))))

    param_grid = {
        'epochs': [1, 5, 10, 20, 30],
        'optimizer': ['rmsprop', 'adam', 'adagrad'],
        'batch_size': [8, 16, 64],
        'class_weight': [class_weights]
    }

    # NOTE: more option eg. custom scoring metric avaliable -> see sklearn doc
    # n_jobs used for optimization (use all processor)
    # verbose -> display detail (0,1,>1) higher -> more detailed
    grid = GridSearchCV(
        model_warpped, 
        param_grid=param_grid, 
        cv=cv, 
        verbose=1, 
        n_jobs=-1, 
        scoring=scoring, 
        return_train_score=True,
        refit=refit)
    clf = grid.fit(X_train, y_train)

    return clf


#### <a id='toc5_3_2_8_'></a>[Feature set interfaces](#toc0_)

##### <a id='toc5_3_2_8_1_'></a>[without cross-validation](#toc0_)

In [None]:
def fit_df_without_cv(X_train, y_train, X_test, y_test, featureset_name="Feature Set 2"):

    # # fit decision Tree Classifier
    # clf = fit_dt_n_cv(X_train, y_train)
    # clf_eval(clf, X_test, y_test, featureset_name=featureset_name, classifier_name="DT")
    
    # # fit random forest
    # clf = fit_rf_n_cv(X_train, y_train)
    # clf_eval(clf, X_test, y_test, featureset_name=featureset_name, classifier_name="RF")

    # # fit knn
    # clf = fit_knn_n_cv(X_train, y_train)
    # clf_eval(clf, X_test, y_test, featureset_name=featureset_name, classifier_name="KNN")

    # # fit SVM
    # clf = fit_svm_n_cv(X_train, y_train)
    # clf_eval(clf, X_test, y_test, featureset_name=featureset_name, classifier_name="SVM")   

    # # fit XGB
    # clf = fit_xgb_n_cv(X_train, y_train)
    # clf_eval(clf, X_test, y_test, featureset_name=featureset_name, classifier_name="XGB")  

    # fit NN
    clf = fit_nn_n_cv(X_train, y_train)
    clf_eval(clf, X_test, y_test, featureset_name=featureset_name, classifier_name="NN") 


##### <a id='toc5_3_2_8_2_'></a>[with cross-validation](#toc0_)

In [None]:
def fit_df_cv(X_train, y_train, X_test, y_test, featureset_name="Feature Set 2"):
    
    # fit decision Tree Classifier
    clf = fit_DT(X_train, y_train)
    results = clf.cv_results_
    # plot validation loss vs trainning loss
    plot_train_test_acc(results, scoring, title="Evaluation using Cross Validation (Decision Tree)", xlabel="max_depth", ylabel="Score", fig_size=(8,8), xlim=(0,20), ylim=(0.2,1.0))
    best_clf = clf.best_estimator_
    clf_eval(best_clf, X_test, y_test, featureset_name=featureset_name, classifier_name="DT")

    # fit Random Forest Classifier
    clf = fit_RF(X_train, y_train)
    results = clf.cv_results_
    # plot validation loss vs trainning loss
    plot_train_test_acc(results, scoring, title="Evaluation using Cross Validation (Random Forest)", xlabel="max_depth", ylabel="Score", fig_size=(8,8), xlim=(0,20), ylim=(0.2,1.0))
    best_clf = clf.best_estimator_
    clf_eval(best_clf, X_test, y_test, featureset_name=featureset_name, classifier_name="RF")

    # fit KNN
    clf = fit_KNN(X_train, y_train)
    results = clf.cv_results_
    # plot validation loss vs trainning loss
    plot_train_test_acc(results, scoring,param_x="param_n_neighbors", title="Evaluation using Cross Validation (KNN)", xlabel="number of neighbors", ylabel="Score", fig_size=(8,8), xlim=(3,100), ylim=(0.2,0.7))
    best_clf = clf.best_estimator_
    clf_eval(best_clf, X_test, y_test, featureset_name=featureset_name, classifier_name="KNN")

    # fit SVM
    clf = fit_SVM(X_train, y_train)
    results = clf.cv_results_
    # plot validation loss vs trainning loss
    plot_train_test_acc(results, scoring,param_x="param_C", title="Evaluation using Cross Validation (SVM, rbf kernel)", xlabel="C value", ylabel="Score", fig_size=(8,8), xlim=(0,35), ylim=(0.2,0.7))
    best_clf = clf.best_estimator_
    clf_eval(best_clf, X_test, y_test, featureset_name=featureset_name, classifier_name="SVM")

    # XGB
    clf = fit_XGB(X_train, y_train)
    results = clf.cv_results_
    #plot validation loss vs trainning loss
    #plot_train_test_acc()
    best_clf = clf.best_estimator_
    clf_eval(best_clf, X_test, y_test, featureset_name=featureset_name, classifier_name="XGB")

    # NN
    clf = fit_NN(X_train, y_train)
    results = clf.cv_results_
    # plot validation loss vs trainning loss
    plot_train_test_acc(results, scoring,param_x="param_epochs", title="Evaluation using Cross Validation (NN)", xlabel="Number of epoch", ylabel="Score", fig_size=(8,8), xlim=(0,35), ylim=(0.2,0.7))
    best_clf = clf.best_estimator_
    clf_eval(best_clf, X_test, y_test, featureset_name=featureset_name, classifier_name="NN")





# <a id='toc6_'></a>[Model Selection via Cross Validation](#toc0_)

### <a id='toc6_1_1_'></a>[Introduction](#toc0_)

Here we aim to optimize the final few feature sets using model selection and cross validation

### <a id='toc6_1_2_'></a>[FEATURE SET 5 (MANUAL) – WITH Model Selection](#toc0_)

#### <a id='toc6_1_2_1_'></a>[Create Design Matrix](#toc0_)

In [None]:
def create_df(df_epl):
  return df_epl.copy()

def create_design_matrix(df):
  X = df.loc[:,['Day', 'Month', 'HomeTeam_Enc', 'AwayTeam_Enc',
                              'HS_HISTORY','AS_HISTORY','HST_HISTORY','AST_HISTORY','HF_HISTORY','AF_HISTORY','HC_HISTORY','AC_HISTORY','HY_HISTORY','AY_HISTORY','HR_HISTORY','AR_HISTORY',
                              'HS_AVG','AS_AVG','HST_AVG','AST_AVG','HF_AVG','AF_AVG','HC_AVG','AC_AVG','HY_AVG','AY_AVG','HR_AVG','AR_AVG','HW_AVG','AW_AVG','Ex_Goals_Home','Ex_Goals_Away',
                              'HomeStanding_PrevSeason','AwayStanding_PrevSeason','DiffStanding_PrevSeason','Home_Manager_Enc','Away_Manager_Enc']].values
  # X = df.drop(['FTR', 'Date'], axis=1).values
  return X

df_final = create_df(df_epl)
X = create_design_matrix(df_final)
y = df_final.loc[:,['FTR']].values.ravel()
# NOTE: Shuffle is on!!!
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, shuffle=True)

FTR_encoder = LabelEncoder()
y_train = FTR_encoder.fit_transform(y_train)

## <a id='toc6_2_'></a>[CONTINUE ON BEST FEATURE SET & EXPLORE MORE ON NN](#toc0_)

From the above results. We select NN as our final classifier model and continue with feature set 5 since it has significant higher balanced accuracy

#### <a id='toc6_2_1_1_'></a>[redefine summary info](#toc0_)

redefine aux_function for later evaluation(summary)

In [None]:
# Used for summary part of each feature Set
model_acc_dict = {
'build_clf_0': 0,
'build_clf_1': 0
}

# NOTE: NO EFFECT HERE, needed just because of eval function
compare_feature_sets_dict = {
'Feature Set 5': {'build_clf_0': 0, 'build_clf_1': 0},
}

### <a id='toc6_2_2_'></a>[Aux Functions](#toc0_)

#### <a id='toc6_2_2_1_'></a>[Build function](#toc0_)

Here we proposed NN with different architectures

In [None]:
def build_clf_0(input_size):

    model = tf.keras.Sequential([
        tf.keras.layers.Dense(input_size, activation='relu', input_dim=input_size),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(3, activation='softmax')
    ])

    model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                metrics=['accuracy'],
                weighted_metrics=['accuracy']
                )
    return model

In [None]:
def build_clf_1(input_size):

    model = tf.keras.Sequential([
        tf.keras.layers.Dense(input_size, activation='relu', input_dim=input_size),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(8, activation='relu'),
        tf.keras.layers.Dense(3, activation='softmax')
    ])

    model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                metrics=['accuracy'],
                weighted_metrics=['accuracy']
                )
    return model

In [None]:
def build_clf_2(input_size):

    model = tf.keras.Sequential([
        tf.keras.layers.Dense(input_size, activation='relu', input_dim=input_size),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(3, activation='softmax')
    ])

    model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                metrics=['accuracy'],
                weighted_metrics=['accuracy']
                )
    return model

In [None]:
def fit_NN_tone(X_train, y_train, build_func):
    '''
    {'model': <function build_clf at 0x7fad83129870>, 
    'build_fn': None, 'warm_start': False, 
    'random_state': None, 
    'optimizer': 'rmsprop', 
    'loss': None, 
    'metrics': None, 
    'batch_size': None, 
    'validation_batch_size': None, 
    'verbose': 1, 
    'callbacks': None, 
    'validation_split': 0.0, 
    'shuffle': True, 
    'run_eagerly': False, 
    'epochs': 10, 
    'input_size': 28, 
    'class_weight': None}
    '''

    # reset seed, everytime before training
    os.environ['PYTHONHASHSEED'] = '42'
    tf.keras.utils.set_random_seed(42)   
    tf.config.experimental.enable_op_determinism()

    # warp this obj into a sklearn classifier
    model_warpped = KerasClassifier(model=build_func, input_size=X_train.shape[1], epochs=10)
    # print(model_warpped.get_params()) 

    # balanced accuracy weights
    counts = np.unique(y_train, return_counts=True)
    class_weights = dict(zip(counts[0], np.reciprocal(counts[1].astype('float64'))))

    param_grid = {
        'epochs': [1, 5, 10, 20, 30, 50],
        'optimizer': ['rmsprop', 'adam', 'adagrad'],
        'batch_size': [8, 16, 64],
        'class_weight': [class_weights]
    }

    # NOTE: more option eg. custom scoring metric avaliable -> see sklearn doc
    # n_jobs used for optimization (use all processor)
    # verbose -> display detail (0,1,>1) higher -> more detailed
    grid = GridSearchCV(
        model_warpped, 
        param_grid=param_grid, 
        cv=cv, 
        verbose=1, 
        # Assign all cores - 2, otherwise the system may crash(so lagging)
        n_jobs=-2, 
        scoring=scoring, 
        return_train_score=True,
        refit=refit)
    clf = grid.fit(X_train, y_train)


    return clf


Fix seed number for NN, so results are reproducable

In [None]:

def fit_NN_tone_CV(X_train, y_train, X_test, y_test, featureset_name="Feature Set 5"):

    my_NNs = [build_clf_0, build_clf_1]
    for cur_build_func in my_NNs:
        clf = fit_NN_tone(X_train, y_train, cur_build_func)
        results = clf.cv_results_
        # plot validation loss vs trainning loss
        plot_train_test_acc(results, scoring,param_x="param_epochs", title="Evaluation using Cross Validation (NN)", xlabel="Number of epoch", ylabel="Score", fig_size=(8,8), xlim=(0,35), ylim=(0.2,0.7))
        best_clf = clf.best_estimator_
        clf_eval(best_clf, X_test, y_test, featureset_name=featureset_name, classifier_name=str(cur_build_func))




#### Evaluation

In [None]:
fit_NN_tone_CV(X_train, y_train, X_test, y_test)

#### Summary

In [None]:
summary_hist(model_acc_dict=model_acc_dict, title="Summary of NNs - Feature Set 5", fig_size=(9, 9))

##### RANDOM FOREST (DEBUG)

In [None]:
# fit Random Forest Classifier
clf = fit_RF(X_train, y_train)
results = clf.cv_results_
# plot validation loss vs trainning loss
plot_train_test_acc(results, scoring, title="Evaluation using Cross Validation (Random Forest)", xlabel="max_depth", ylabel="Score", fig_size=(8,8), xlim=(0,20), ylim=(0.2,1.0))
best_clf = clf.best_estimator_
clf_eval(best_clf, X_test, y_test, featureset_name="Feature Set 5", classifier_name="RF")

##### WITHOUT CV

In [None]:
# fit random forest
clf = fit_rf_n_cv(X_train, y_train)
clf_eval(clf, X_test, y_test, featureset_name='Feature Set 5', classifier_name="RF")