# **Testing ML Algorithms**

## Objectives

* Using accuracy as a readout, to test which 2 algorithms give the best results

## Inputs

* The data file, "US_Accidents_For_ML.csv", which is saved locally in "Data/ML"

## Outputs

* The 2 best performing algorithms which will be carried forward for hyperparamter optimisation

## Summary of Steps

* Load the dataset
* Split data into train and test sets
* Test algorithms with data that has not been transformed
* Test algorithms with data that has been transformed
* Test algorithms with "Start_Lat" and "Start_Lng" columns dropped (28 variables)
* Test algorithms with a reduced set of 20 variables (cut from 30)
* Test algorithms with a reduced set of 20 variables (cut from 30) and transformed numerical variables
* Pick the 2 algorithms, and dataset, that give the best accuracy 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [70]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [71]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [73]:
os.chdir(r"c:\Users\sonia\Documents\VS Studio Projects\US_Accidents_ML_Project")

os.getcwd()

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project'

Confirm the new current directory

In [74]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Accidents_ML_Project'

---

## Required Libraries

In [75]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import QuantileTransformer
from feature_engine.transformation import YeoJohnsonTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix

In [76]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from xgboost import XGBClassifier
from mord import LogisticAT
from sklearn.neural_network import MLPClassifier

---

## Load the Dataset

I will use Pandas to load my dataset into a DataFrame.

In [77]:
df = pd.read_csv("Data/ML/US_Accidents_For_ML.csv")
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0,Severity,Start_Lat,Start_Lng,Distance(mi),Timezone,Temperature(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Station,Stop,Traffic_Calming,Traffic_Signal,Sunrise_Sunset,Clearance_Class,Weather_Simplified,State_Other,Road_Type,Population,County_Other,Month
0,2,32.456486,-93.774536,0.501,Central,78.0,62.0,29.61,10.0,CALM,0.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Night,Very Long,Fair,LA,Avenue,187540,Caddo,Sep
1,2,36.804693,-76.189728,0.253,Eastern,54.0,90.0,30.4,7.0,CALM,0.0,0.0,False,False,True,False,False,False,False,False,False,False,True,Night,Very Long,Fair,VA,Drive,459444,Virginia Beach,May
2,2,29.895741,-90.090026,1.154,Pacific,40.0,58.0,30.28,10.0,N,10.0,0.0,False,False,False,False,True,False,False,False,False,False,False,Day,Very Long,Cloudy,LA,Drive,440784,Jefferson,Jan
3,2,32.456459,-93.779709,0.016,Central,62.0,75.0,29.8,10.0,SSE,8.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Night,Very Long,Cloudy,LA,Avenue,187540,Caddo,Nov
4,2,26.966433,-82.255414,0.057,Eastern,84.0,69.0,29.99,10.0,E,18.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Day,Very Long,Cloudy,FL,Boulevard,186824,Charlotte,Sep


---

## Split into Train and Test

First, I will split the data into train (80 %) and test (80 %) sets. I am also encoding the target, "Clearance_Class", for use with the ordinal logistic regression algorithm, "LogisticAT".

In [78]:
mapping = {'Short': 0, 'Moderate': 1, 'Long': 2, "Very Long": 3}
df['Clearance_Class_num'] = df['Clearance_Class'].map(mapping)

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['Clearance_Class', "Clearance_Class_num"], axis=1),
    df['Clearance_Class_num'],
    test_size=0.2,
    random_state=0
)
print(
    "* Train set:",
    X_train.shape,
    y_train.shape,
    "\n* Test set:",
    X_test.shape,
    y_test.shape,
)

* Train set: (7974, 30) (7974,) 
* Test set: (1994, 30) (1994,)


Now I am ready to compare algorithms.

---

## Comparison of Algorithms - No Transformed Variables

Here, I have kept the variables "Start_Lat" and "Start_Lng", which are treated the same as other numerical data and scaled. Categorical variables are encoded with OneHotEncoder i.e. randomly. 

Steps to be included in the preprocessor step:

- Encoding categorical variables
- Scaling numerical variables

Steps to be included in the pipeline:
- Preprocessor
- Feature selection
- Model selection 

In [79]:
def PipelineOptimization(model, X):
    numeric_cols = X.select_dtypes(include="number").columns.tolist()
    categorical_cols = X.select_dtypes(include="object").columns.tolist()

    numeric_pipeline = Pipeline([
        ("scaler", StandardScaler())
    ])

    categorical_pipeline = Pipeline([
        ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ])

    preprocessor = ColumnTransformer(transformers=[
        ("numeric", numeric_pipeline, numeric_cols),
        ("categorical", categorical_pipeline, categorical_cols)
    ])

    steps = [("preprocessor", preprocessor)]

    # Add feature selection if the model supports it
    if hasattr(model, "coef_") or hasattr(model, "feature_importances_"):
        steps.append(("feat_selection", SelectFromModel(model)))

    steps.append(("model", model))

    pipeline = Pipeline(steps)
    return pipeline

These are the algorithms I am testing. They are mostly tree and ensemle tree based, but I also wanted to test a neural network (MLPClassifier) and logistic regression (LogisticAT) based agorithm.

In [80]:
models_search = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "HGrBoostingClassifier": HistGradientBoostingClassifier(random_state=0),
    "Xgb": XGBClassifier(random_state=0), 
    "OrdinalLogistic": LogisticAT(),
    "Mlp": MLPClassifier(random_state=0)
}

At this first stage, I am using default hyperparameter settings.

In [81]:
params_search = {
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "HGrBoostingClassifier": {},
    "Xgb": {},
    "OrdinalLogistic": {},
    "Mlp": {}
}

This function uses GridSearchCV to run grid searches for multiple models and hyperparameters, specified above, in a loop, saves their results and then returns a ranked, summary table of their performance.

In [82]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = list(models.keys())
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model_pipeline = PipelineOptimization(self.models[key], X)
            params = self.params[key]
            gs = GridSearchCV(model_pipeline, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, refit=refit)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            gs = self.grid_searches[k]
            params = gs.cv_results_['params']
            all_scores = gs.cv_results_['mean_test_score']
            for p, s in zip(params, all_scores):
                rows.append(row(k, [s], p))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns += [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

Now I train the models, creating an instance of HyperparameterOptimizationSearch, which contains the algorithms and hyperparameters to use.

Cross validation (cv) is used to equaluate how well the model generalises. It splits the train set into cv=*n*  parts (folds), and then runs *n* rounds of training and testing, rotating which fold is tested each time. 
Here, I use cv=2, which is OK for big datasets and experimenting as I'm doing here. Usually, cv=5 is the minimum used to gain accurate results. 

In [83]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(
    X_train, y_train,
    scoring='accuracy',
    n_jobs=-1,
    cv=2
)


Running GridSearchCV for DecisionTreeClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for RandomForestClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for GradientBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for ExtraTreesClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for AdaBoostClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for HGrBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Xgb 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for OrdinalLogistic 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Mlp 

Fitting 2 folds for each of 1 candidates, totalling 2 fits


In [84]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
2,GradientBoostingClassifier,0.599448,0.599448,0.599448,0.0
1,RandomForestClassifier,0.59832,0.59832,0.59832,0.0
6,Xgb,0.590043,0.590043,0.590043,0.0
5,HGrBoostingClassifier,0.589917,0.589917,0.589917,0.0
4,AdaBoostClassifier,0.547404,0.547404,0.547404,0.0
3,ExtraTreesClassifier,0.538124,0.538124,0.538124,0.0
0,DecisionTreeClassifier,0.502508,0.502508,0.502508,0.0
8,Mlp,0.490971,0.490971,0.490971,0.0
7,OrdinalLogistic,0.450088,0.450088,0.450088,0.0


The best performing models are Gradient Boosting Classifier (accuracy = 59.9 %) followed very closely by Random Forest Classifier (accuracy = 59.8 %).

---

## Comparison of Algorithms - Transformed Variables

Above, I find that the Mlp and OrdinalLogistic models performed worst. These algorithms are more sensitive to skewed data, while tree based algorithms are generally not. Now, I will transform the numeric variables as optimised in the previous notebook "Number_Feature_Transformation", and train the models again to see if this improves accuracy.

I will not be transforming "Start_Lat" and "Start_Lng", but will be scaling them as done above.

Steps to be included in the preprocessor step:

- Transformation of numeric variables (except "Start_Lat" and "Start_Lng")
- Scaling of numeric variables
- Encoding categoric variables

Steps to be included in the pipeline:
- Preprocessor
- Feature selection
- Model selection 

In [85]:
def PipelineOptimization(model, X):
    numeric_yeojohnson = ["Distance(mi)", "Humidity(%)", "Population", "Pressure(in)", "Wind_Speed(mph)"]
    numeric_quantile = ["Temperature(F)", "Visibility(mi)", "Precipitation(in)"]
    numeric_coords = ["Start_Lat", "Start_Lng"]
    categorical_cols = X.select_dtypes(include="object").columns.tolist()

    yeojohnson_pipeline = Pipeline([
        ("yeojohnson", YeoJohnsonTransformer(variables=numeric_yeojohnson)),
        ("scaler", StandardScaler())
    ])

    quantile_pipeline = Pipeline([
        ("quantile", QuantileTransformer(output_distribution="normal")),
        ("scaler", StandardScaler())
    ])

    coords_pipeline = Pipeline([
        ("scaler", StandardScaler())
    ])

    categorical_pipeline = Pipeline([
        ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ])

    preprocessor = ColumnTransformer(transformers=[
        ("yeojohnson", yeojohnson_pipeline, numeric_yeojohnson),
        ("quantile", quantile_pipeline, numeric_quantile),
        ("coords", coords_pipeline, numeric_coords),
        ("categorical", categorical_pipeline, categorical_cols)
    ])

    steps = [("preprocessor", preprocessor)]

    # Add feature selection if the model supports it
    if hasattr(model, "coef_") or hasattr(model, "feature_importances_"):
        steps.append(("feat_selection", SelectFromModel(model)))

    steps.append(("model", model))

    pipeline = Pipeline(steps)
    return pipeline

In [86]:
models_search = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "HGrBoostingClassifier": HistGradientBoostingClassifier(random_state=0),
    "Xgb": XGBClassifier(random_state=0), 
    "OrdinalLogistic": LogisticAT(),
    "Mlp": MLPClassifier(random_state=0)
}

In [87]:
params_search = {
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "HGrBoostingClassifier": {},
    "Xgb": {},
    "OrdinalLogistic": {},
    "Mlp": {}
}

In [88]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = list(models.keys())
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model_pipeline = PipelineOptimization(self.models[key], X)
            params = self.params[key]
            gs = GridSearchCV(model_pipeline, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, refit=refit)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            gs = self.grid_searches[k]
            params = gs.cv_results_['params']
            all_scores = gs.cv_results_['mean_test_score']
            for p, s in zip(params, all_scores):
                rows.append(row(k, [s], p))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns += [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

In [89]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(
    X_train, y_train,
    scoring='accuracy',
    n_jobs=-1,
    cv=2
)


Running GridSearchCV for DecisionTreeClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for RandomForestClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for GradientBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for ExtraTreesClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for AdaBoostClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for HGrBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Xgb 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for OrdinalLogistic 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Mlp 

Fitting 2 folds for each of 1 candidates, totalling 2 fits


In [90]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
1,RandomForestClassifier,0.591924,0.591924,0.591924,0.0
2,GradientBoostingClassifier,0.589666,0.589666,0.589666,0.0
5,HGrBoostingClassifier,0.582894,0.582894,0.582894,0.0
6,Xgb,0.574241,0.574241,0.574241,0.0
3,ExtraTreesClassifier,0.548533,0.548533,0.548533,0.0
4,AdaBoostClassifier,0.544018,0.544018,0.544018,0.0
8,Mlp,0.496739,0.496739,0.496739,0.0
0,DecisionTreeClassifier,0.495987,0.495987,0.495987,0.0
7,OrdinalLogistic,0.482443,0.482443,0.482443,0.0


The best performing models with transformed data are Random Forest Classifier (accuracy = 59.2 %) followed by Gradient Boosting Classifier (accuracy = 59.0 %).

Compared with the non-transformed data above, the transformed dataset performed slightly worse overall, with the exception being OrdinalLogistic, whose accuracy went from 45.0 % to 48.2 %.

---

## Effect of Latitude and Longitude data

I am interested in understanding whether the agorithms will perform better without "Start_Lat" and "Start_Lng".  I am going to repeat the training without these variables and using non-transformed data. 

In [101]:
keep_col = [
    "Severity",
    "Distance(mi)",
    "Timezone",
    "Temperature(F)",
    "Humidity(%)",
    "Pressure(in)",
    "Visibility(mi)",
    "Wind_Direction",
    "Wind_Speed(mph)",
    "Precipitation(in)",
    "Amenity",
    "Bump",
    "Crossing",
    "Give_Way",
    "Junction",
    "No_Exit",
    "Railway",
    "Station",
    "Stop",
    "Traffic_Calming",
    "Traffic_Signal",
    "Sunrise_Sunset",
    "Clearance_Class",
    "Weather_Simplified",
    "State_Other",
    "Road_Type",
    "Population",
    "County_Other",
    "Month"
]

df_keep = df[keep_col].copy()
df_keep.head()

Unnamed: 0,Severity,Distance(mi),Timezone,Temperature(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Station,Stop,Traffic_Calming,Traffic_Signal,Sunrise_Sunset,Clearance_Class,Weather_Simplified,State_Other,Road_Type,Population,County_Other,Month
0,2,0.501,Central,78.0,62.0,29.61,10.0,CALM,0.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Night,Very Long,Fair,LA,Avenue,187540,Caddo,Sep
1,2,0.253,Eastern,54.0,90.0,30.4,7.0,CALM,0.0,0.0,False,False,True,False,False,False,False,False,False,False,True,Night,Very Long,Fair,VA,Drive,459444,Virginia Beach,May
2,2,1.154,Pacific,40.0,58.0,30.28,10.0,N,10.0,0.0,False,False,False,False,True,False,False,False,False,False,False,Day,Very Long,Cloudy,LA,Drive,440784,Jefferson,Jan
3,2,0.016,Central,62.0,75.0,29.8,10.0,SSE,8.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Night,Very Long,Cloudy,LA,Avenue,187540,Caddo,Nov
4,2,0.057,Eastern,84.0,69.0,29.99,10.0,E,18.0,0.0,False,False,False,False,False,False,False,False,False,False,False,Day,Very Long,Cloudy,FL,Boulevard,186824,Charlotte,Sep


In [102]:
mapping = {'Short': 0, 'Moderate': 1, 'Long': 2, "Very Long": 3}
df_keep['Clearance_Class_num'] = df_keep['Clearance_Class'].map(mapping)

X_train, X_test, y_train, y_test = train_test_split(
    df_keep.drop(['Clearance_Class', "Clearance_Class_num"], axis=1),
    df_keep['Clearance_Class_num'],
    test_size=0.2,
    random_state=0
)
print(
    "* Train set:",
    X_train.shape,
    y_train.shape,
    "\n* Test set:",
    X_test.shape,
    y_test.shape,
)

* Train set: (7974, 28) (7974,) 
* Test set: (1994, 28) (1994,)


In [103]:
def PipelineOptimization(model, X):
    numeric_cols = X.select_dtypes(include="number").columns.tolist()
    categorical_cols = X.select_dtypes(include="object").columns.tolist()

    numeric_pipeline = Pipeline([
        ("scaler", StandardScaler())
    ])

    categorical_pipeline = Pipeline([
        ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ])

    preprocessor = ColumnTransformer(transformers=[
        ("numeric", numeric_pipeline, numeric_cols),
        ("categorical", categorical_pipeline, categorical_cols)
    ])

    steps = [("preprocessor", preprocessor)]

    # Add feature selection if the model supports it
    if hasattr(model, "coef_") or hasattr(model, "feature_importances_"):
        steps.append(("feat_selection", SelectFromModel(model)))

    steps.append(("model", model))

    pipeline = Pipeline(steps)
    return pipeline

In [104]:
models_search = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "HGrBoostingClassifier": HistGradientBoostingClassifier(random_state=0),
    "Xgb": XGBClassifier(random_state=0), 
    "OrdinalLogistic": LogisticAT(),
    "Mlp": MLPClassifier(random_state=0)
}

In [105]:
params_search = {
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "HGrBoostingClassifier": {},
    "Xgb": {},
    "OrdinalLogistic": {},
    "Mlp": {}
}

In [106]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = list(models.keys())
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model_pipeline = PipelineOptimization(self.models[key], X)
            params = self.params[key]
            gs = GridSearchCV(model_pipeline, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, refit=refit)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            gs = self.grid_searches[k]
            params = gs.cv_results_['params']
            all_scores = gs.cv_results_['mean_test_score']
            for p, s in zip(params, all_scores):
                rows.append(row(k, [s], p))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns += [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

In [107]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(
    X_train, y_train,
    scoring='accuracy',
    n_jobs=-1,
    cv=2
)


Running GridSearchCV for DecisionTreeClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for RandomForestClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for GradientBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for ExtraTreesClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for AdaBoostClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for HGrBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Xgb 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for OrdinalLogistic 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Mlp 

Fitting 2 folds for each of 1 candidates, totalling 2 fits


In [108]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
1,RandomForestClassifier,0.596689,0.596689,0.596689,0.0
2,GradientBoostingClassifier,0.596689,0.596689,0.596689,0.0
5,HGrBoostingClassifier,0.586782,0.586782,0.586782,0.0
6,Xgb,0.584525,0.584525,0.584525,0.0
4,AdaBoostClassifier,0.552044,0.552044,0.552044,0.0
3,ExtraTreesClassifier,0.531101,0.531101,0.531101,0.0
0,DecisionTreeClassifier,0.505016,0.505016,0.505016,0.0
8,Mlp,0.485954,0.485954,0.485954,0.0
7,OrdinalLogistic,0.448583,0.448583,0.448583,0.0


The best performing models are Random Forest Classifier and Gradient Boosting Classifier (accuracy = 59.6 %).

This version of the non-transformed dataset, without the variables "Start_Lat" and "Start_Lng", has performed almost as well as with them (0.3 % difference) , while also saving on processing power.

---

## Effect of Dropping Non-Significant Variables 

As dropping "Start_Lat" and "Start_Lng" has a small favourable outcome, I will try dropping more variables before training the models again. I am cutting the boolean variables "Amenity, "Bump", "Crossing", "Give-Way", "Junction", "No_Exit", "Railway" and "Traffic_Calming", which did not show any statistically significant differences between clearance classes or clearance time.

In [117]:
keep_col = [
    "Severity",
    "Distance(mi)",
    "Start_Lat",
    "Start_Lng",
    "Timezone",
    "Temperature(F)",
    "Humidity(%)",
    "Pressure(in)",
    "Visibility(mi)",
    "Wind_Direction",
    "Wind_Speed(mph)",
    "Precipitation(in)",
    "Station",
    "Stop",
    "Traffic_Signal",
    "Sunrise_Sunset",
    "Clearance_Class",
    "Weather_Simplified",
    "State_Other",
    "Road_Type",
    "Population",
    "County_Other",
    "Month"
]

df_keep = df[keep_col].copy()
df_keep.head()

Unnamed: 0,Severity,Distance(mi),Start_Lat,Start_Lng,Timezone,Temperature(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Station,Stop,Traffic_Signal,Sunrise_Sunset,Clearance_Class,Weather_Simplified,State_Other,Road_Type,Population,County_Other,Month
0,2,0.501,32.456486,-93.774536,Central,78.0,62.0,29.61,10.0,CALM,0.0,0.0,False,False,False,Night,Very Long,Fair,LA,Avenue,187540,Caddo,Sep
1,2,0.253,36.804693,-76.189728,Eastern,54.0,90.0,30.4,7.0,CALM,0.0,0.0,False,False,True,Night,Very Long,Fair,VA,Drive,459444,Virginia Beach,May
2,2,1.154,29.895741,-90.090026,Pacific,40.0,58.0,30.28,10.0,N,10.0,0.0,False,False,False,Day,Very Long,Cloudy,LA,Drive,440784,Jefferson,Jan
3,2,0.016,32.456459,-93.779709,Central,62.0,75.0,29.8,10.0,SSE,8.0,0.0,False,False,False,Night,Very Long,Cloudy,LA,Avenue,187540,Caddo,Nov
4,2,0.057,26.966433,-82.255414,Eastern,84.0,69.0,29.99,10.0,E,18.0,0.0,False,False,False,Day,Very Long,Cloudy,FL,Boulevard,186824,Charlotte,Sep


In [118]:
mapping = {'Short': 0, 'Moderate': 1, 'Long': 2, "Very Long": 3}
df_keep['Clearance_Class_num'] = df_keep['Clearance_Class'].map(mapping)

X_train, X_test, y_train, y_test = train_test_split(
    df_keep.drop(['Clearance_Class', "Clearance_Class_num"], axis=1),
    df_keep['Clearance_Class_num'],
    test_size=0.2,
    random_state=0
)
print(
    "* Train set:",
    X_train.shape,
    y_train.shape,
    "\n* Test set:",
    X_test.shape,
    y_test.shape,
)

* Train set: (7974, 22) (7974,) 
* Test set: (1994, 22) (1994,)


In [119]:
def PipelineOptimization(model, X):
    numeric_cols = X.select_dtypes(include="number").columns.tolist()
    categorical_cols = X.select_dtypes(include="object").columns.tolist()

    numeric_pipeline = Pipeline([
        ("scaler", StandardScaler())
    ])

    categorical_pipeline = Pipeline([
        ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ])

    preprocessor = ColumnTransformer(transformers=[
        ("numeric", numeric_pipeline, numeric_cols),
        ("categorical", categorical_pipeline, categorical_cols)
    ])

    steps = [("preprocessor", preprocessor)]

    # Add feature selection if the model supports it
    if hasattr(model, "coef_") or hasattr(model, "feature_importances_"):
        steps.append(("feat_selection", SelectFromModel(model)))

    steps.append(("model", model))

    pipeline = Pipeline(steps)
    return pipeline

In [120]:
models_search = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "HGrBoostingClassifier": HistGradientBoostingClassifier(random_state=0),
    "Xgb": XGBClassifier(random_state=0), 
    "OrdinalLogistic": LogisticAT(),
    "Mlp": MLPClassifier(random_state=0)
}

In [121]:
params_search = {
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "HGrBoostingClassifier": {},
    "Xgb": {},
    "OrdinalLogistic": {},
    "Mlp": {}
}

In [122]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = list(models.keys())
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model_pipeline = PipelineOptimization(self.models[key], X)
            params = self.params[key]
            gs = GridSearchCV(model_pipeline, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, refit=refit)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            gs = self.grid_searches[k]
            params = gs.cv_results_['params']
            all_scores = gs.cv_results_['mean_test_score']
            for p, s in zip(params, all_scores):
                rows.append(row(k, [s], p))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns += [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

In [123]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(
    X_train, y_train,
    scoring='accuracy',
    n_jobs=-1,
    cv=2
)


Running GridSearchCV for DecisionTreeClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for RandomForestClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for GradientBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for ExtraTreesClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for AdaBoostClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for HGrBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Xgb 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for OrdinalLogistic 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Mlp 

Fitting 2 folds for each of 1 candidates, totalling 2 fits


In [124]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
2,GradientBoostingClassifier,0.600828,0.600828,0.600828,0.0
1,RandomForestClassifier,0.598947,0.598947,0.598947,0.0
6,Xgb,0.590168,0.590168,0.590168,0.0
5,HGrBoostingClassifier,0.589917,0.589917,0.589917,0.0
4,AdaBoostClassifier,0.547404,0.547404,0.547404,0.0
3,ExtraTreesClassifier,0.533609,0.533609,0.533609,0.0
0,DecisionTreeClassifier,0.507399,0.507399,0.507399,0.0
8,Mlp,0.49398,0.49398,0.49398,0.0
7,OrdinalLogistic,0.450088,0.450088,0.450088,0.0


The removal of variables which did not have a significant impact on clearance times/ classes, while keeping "Start_Lat" and "Start_Lng" has, so far, produced the best performance, with Gradient Boosting Classifier (60.1 %) and Random Forest Classifier (59.9 %).

---

## Reduced Dataset with Transformation

For completeness, I am going to test the reduced dataset together with the transformation of numerical variables.

In [125]:
def PipelineOptimization(model, X):
    numeric_yeojohnson = ["Distance(mi)", "Humidity(%)", "Population", "Pressure(in)", "Wind_Speed(mph)"]
    numeric_quantile = ["Temperature(F)", "Visibility(mi)", "Precipitation(in)"]
    categorical_cols = X.select_dtypes(include="object").columns.tolist()

    yeojohnson_pipeline = Pipeline([
        ("yeojohnson", YeoJohnsonTransformer(variables=numeric_yeojohnson)),
        ("scaler", StandardScaler())
    ])

    quantile_pipeline = Pipeline([
        ("quantile", QuantileTransformer(output_distribution="normal")),
        ("scaler", StandardScaler())
    ])

    categorical_pipeline = Pipeline([
        ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ])

    preprocessor = ColumnTransformer(transformers=[
        ("yeojohnson", yeojohnson_pipeline, numeric_yeojohnson),
        ("quantile", quantile_pipeline, numeric_quantile),
        ("categorical", categorical_pipeline, categorical_cols)
    ])

    steps = [("preprocessor", preprocessor)]

    # Add feature selection if the model supports it
    if hasattr(model, "coef_") or hasattr(model, "feature_importances_"):
        steps.append(("feat_selection", SelectFromModel(model)))

    steps.append(("model", model))

    pipeline = Pipeline(steps)
    return pipeline

In [126]:
models_search = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "HGrBoostingClassifier": HistGradientBoostingClassifier(random_state=0),
    "Xgb": XGBClassifier(random_state=0), 
    "OrdinalLogistic": LogisticAT(),
    "Mlp": MLPClassifier(random_state=0)
}

In [127]:
params_search = {
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "HGrBoostingClassifier": {},
    "Xgb": {},
    "OrdinalLogistic": {},
    "Mlp": {}
}

In [128]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = list(models.keys())
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model_pipeline = PipelineOptimization(self.models[key], X)
            params = self.params[key]
            gs = GridSearchCV(model_pipeline, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, refit=refit)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            gs = self.grid_searches[k]
            params = gs.cv_results_['params']
            all_scores = gs.cv_results_['mean_test_score']
            for p, s in zip(params, all_scores):
                rows.append(row(k, [s], p))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns += [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

In [129]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(
    X_train, y_train,
    scoring='accuracy',
    n_jobs=-1,
    cv=2
)


Running GridSearchCV for DecisionTreeClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for RandomForestClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for GradientBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for ExtraTreesClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for AdaBoostClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for HGrBoostingClassifier 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Xgb 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for OrdinalLogistic 

Fitting 2 folds for each of 1 candidates, totalling 2 fits

Running GridSearchCV for Mlp 

Fitting 2 folds for each of 1 candidates, totalling 2 fits


In [130]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
1,RandomForestClassifier,0.592175,0.592175,0.592175,0.0
2,GradientBoostingClassifier,0.586657,0.586657,0.586657,0.0
6,Xgb,0.577251,0.577251,0.577251,0.0
5,HGrBoostingClassifier,0.575245,0.575245,0.575245,0.0
3,ExtraTreesClassifier,0.546401,0.546401,0.546401,0.0
4,AdaBoostClassifier,0.535616,0.535616,0.535616,0.0
0,DecisionTreeClassifier,0.502634,0.502634,0.502634,0.0
8,Mlp,0.492852,0.492852,0.492852,0.0
7,OrdinalLogistic,0.48144,0.48144,0.48144,0.0


Once again, transforming numerical data has not resulted in the best performance overall. I will not be transforming the numerical variables.

---

## Conclusion and Next Steps

* The 2 best model performances was found by using a reduced dataset of 23 variables, with Gradient Boosting Classifier and Random Forest Classifier
* Overall, transformation of numerical data did not improve performance
* The next step is to tune hyperparameters for Gradient Boosting Classifier and Random Forest Classifier, with the reduced dataset