In this notebook, I will look to find the best models, along with the best pre-processing techniques to be used to predict the number of home fouls per game. 

Feature scaling methods:
* Standardizing
* Normalizing

Feature selection methods:
* filter method
* forward selection
* backwards elimination
* feature importance

Models to be used:
* Decision tree
* Random Forest 
* XGBoost
* SVM
* Poisson

# Imports

In [1]:
import os
os.chdir("..")

In [2]:
import math
import pandas as pd
from sklearn.model_selection import train_test_split

from src.models.functions import (
    SVMRegression,
    PoissonRegression,
    DecisionTreeRegression,
    RandomForestRegression,
    XGBRegression,
)
from src.etl.clean import (
    clean_fouls_dataset,
    categorical_variables_to_numeric,
)
from src.etl.feature_scaling import standardize_data, normalize_data
from src.etl.feature_selection import get_feature_selection_variables_dict
from src.etl.evaluation_metrics import (
    calculate_mean_squared_error,
    calculate_root_mean_squared_error,
    calculate_mean_absolute_error,
)

pd.set_option('display.max_columns', None)

# Load data set

In [3]:
fouls_df = pd.read_csv('data/foulsDataset.csv', encoding='latin1')

# Clean data

Here we will clean our fouls data to replace any missing values as well as changing categorical columns to numeric as most models require numeric values

In [4]:
fouls_df = clean_fouls_dataset(fouls_df)
fouls_df = categorical_variables_to_numeric(fouls_df)

In [5]:
fouls_df.head(3)

Unnamed: 0,competition_level,team1_goals,team2_goals,distance,sup_implied,tg_implied,team1_shot_on_target,team1_shot_off_target,team1_corners,team1_fouls,team1_offsides,team1_yellow_cards,team1_red_cards,team1_penalties_awarded,team2_shot_on_target,team2_shot_off_target,team2_corners,team2_fouls,team2_offsides,team2_yellow_cards,team2_red_cards,team2_penalties_awarded,year,month,day,minutes,hour,rounded_hour,France,Germany,Italy,Spain,EngPr,FraL1,FraL2,GerBL1,GerBL2,ItaSA,SpaPr,SpaSe,team1_name_enc,team2_name_enc,referee_enc
0,2,0,0,389.3,0.260384,2.267575,3.0,6.0,6.0,12.0,2.0,1.0,0.0,0.0,2.0,7.0,2.0,12.0,2.0,2.0,0.0,0.0,14.0,8.0,1.0,0.0,18.0,18.0,1,0,0,0,0,0,1,0,0,0,0,0,13.325046,13.054772,13.792619
1,2,2,0,317.1,0.008859,2.287182,3.0,4.0,3.0,13.0,2.0,1.0,0.0,1.0,5.0,6.0,6.0,18.0,2.0,2.0,0.0,0.0,14.0,8.0,1.0,0.0,18.0,18.0,1,0,0,0,0,0,1,0,0,0,0,0,13.245871,13.007635,13.219941
2,2,2,0,1016.0,0.360465,2.238419,5.0,6.0,4.0,10.0,3.0,0.0,0.0,0.0,0.0,2.0,1.0,23.0,2.0,0.0,0.0,0.0,14.0,8.0,1.0,0.0,18.0,18.0,1,0,0,0,0,0,1,0,0,0,0,0,13.630486,13.156533,13.784554


# Scale data

Define target variable

In [6]:
# Define target variable
team_dict = {'home':1, 'away':2}
target_variable_name = f"team{team_dict['home']}_fouls"

Normalize and standardize fouls_df. Create a dictionary to hold the different scaling techniques.

NB: we will not do this to the target variable.

In [7]:
# Set target variable and features
target_variable = fouls_df[target_variable_name]
fouls_notarget_df = fouls_df.drop(columns=[target_variable_name], axis=1)
fouls_notarget_standard_df = standardize_data(fouls_notarget_df)
fouls_notarget_norm_df = normalize_data(fouls_notarget_df)

feature_scaling_dataframe_dict = {}
feature_scaling_dataframe_dict['no_scaling'] = fouls_notarget_df
feature_scaling_dataframe_dict['standardize'] = fouls_notarget_standard_df
feature_scaling_dataframe_dict['normalize'] = fouls_notarget_norm_df

# Compare models

### Define models dict

Specify the models to be used. 

NB: Majority of models used are from sci-kit learn or other python packages but an extension will be to look at other ways to perform these models.

In [8]:
# model list
models_dict = {
    'SVR': SVMRegression(), 
    'Poisson': PoissonRegression(), 
    'DT': DecisionTreeRegression(),
    'RF': RandomForestRegression(),
    'XgBoost': XGBRegression(),
}

### Run and evaluate models

In [9]:
table_row_list = []

for scaling_method, scaled_df in feature_scaling_dataframe_dict.items():
    feature_selection_variables_dict  = get_feature_selection_variables_dict(scaled_df, target_variable)
    for feature_selection_method, features in feature_selection_variables_dict.items():
        for model_name, model in models_dict.items():
            scaled_filtered_df = scaled_df[features]
            X= scaled_filtered_df
            y=target_variable
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=0.3, random_state=101
            )
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            model_data=[
                scaling_method,
                feature_selection_method,
                model_name, 
                len(X.columns), 
                calculate_mean_squared_error(y_test, y_pred),
                calculate_root_mean_squared_error(y_test, y_pred),
                calculate_mean_absolute_error(y_test, y_pred),
            ]
            table_row_list.append(model_data)

# create evaluation dataframe
evaluation_df = pd.DataFrame(
    data=table_row_list,
    columns=[
        "dataset",
        "feature_selection_method",
        "model",
        "number_of_features",
        "MSE",
        "RMSE",
        "MAE"
    ]
)

  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights


Look at the top five best performing models

In [10]:
evaluation_df.sort_values(by=['RMSE']).reset_index(drop=True).head(5)

Unnamed: 0,dataset,feature_selection_method,model,number_of_features,MSE,RMSE,MAE
0,normalize,forward_selection,SVR,27,11.939879,3.455413,2.723677
1,normalize,backward_elimination,SVR,23,11.96075,3.458432,2.725541
2,standardize,backward_elimination,Poisson,25,12.141609,3.484481,2.76242
3,standardize,forward_selection,Poisson,25,12.15227,3.486011,2.764799
4,standardize,forward_selection,SVR,25,12.212913,3.494698,2.754165


Look at the top 5 best performing models with ten features or less

In [11]:
evaluation_df.query('number_of_features <= 10').sort_values(by=['RMSE']).reset_index(drop=True).head(5)

Unnamed: 0,dataset,feature_selection_method,model,number_of_features,MSE,RMSE,MAE
0,normalize,filter_method,SVR,10,12.432025,3.525908,2.794891
1,standardize,filter_method,SVR,10,12.503008,3.535959,2.801449
2,standardize,filter_method,Poisson,10,12.581873,3.547094,2.818515
3,no_scaling,filter_method,Poisson,10,12.664105,3.558666,2.824048
4,no_scaling,filter_method,SVR,10,12.714124,3.565687,2.819529


From the above two tables we can see that SVR and Poisson are the two models that have performed the best with SVR being the bst model. For the best performing model, the used feature selection method was forward selection or backward elimination. When we limited the number of features to ten or less, the feature selection technique that worked the best was the filter method. Important to note that good performing models with less features had a slightly higher RMSE but not signifant enough to justify using so many more features as used from the top five performing models.