In this notebook we will look into creating multiple models and fine tuning them to get the best possible results.

I will be using the following models:
1. BaggingClassifier
2. RandomForestClassifier
3. ExtraTreesClassifier
4. VotingClassifier (with TBD models)
5. GaussianNB
6. KNeighborsClassifier
7. MLPClassifier
8. LinearTreeClassifier
9. LinearForestClassifier
10. LinearBoostClassifier

The choise of the models are not based on any specific reason, but rather to try out different models and see how they perform. There is although one condition, the model needs to be able to give a probability output, as this will be used in order to give a confidence score.

In [10]:
from sklearn.ensemble import BaggingClassifier # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
from sklearn.ensemble import RandomForestClassifier #https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier #https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier #https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
from sklearn.naive_bayes import GaussianNB #https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB
from sklearn.neighbors import KNeighborsClassifier #https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
from sklearn.neural_network import MLPClassifier #https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
from lineartree import LinearTreeClassifier, LinearForestClassifier, LinearBoostClassifier #https://github.com/cerlymarco/linear-tree

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd

%run data-cleaning.ipynb
%run model.py

In [27]:
df = pd.read_csv(r"dataset-prorail-clean-3.csv")
df = clean_data(df)

df['duur_prog_fh_seconds'] = df['duur_prog_fh'].dt.total_seconds()
num_bins = 10
df, bin_edges = create_bins(df, 'duur_prog_fh_seconds', num_bins)
label_encoder = LabelEncoder()
df['duur_prog_fh_seconds_bins_enc'] = label_encoder.fit_transform(df['duur_prog_fh_seconds_bins'])

In [39]:
features = ['stm_geo_mld', 'stm_prioriteit', 'stm_oorz_code', 'stm_contractgeb_gst', 'stm_km_van_mld', 'stm_km_tot_mld', 'stm_techn_mld']
target = 'duur_prog_fh_seconds_bins_enc'

x_train, x_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)
train_models(models, x_train, x_test, y_train, y_test)

Training models...
BaggingClassifier - 0.3943472409152086 - 11.93635055080362
RandomForestClassifier - 0.4003520033129724 - 4.738378568965997
ExtraTreesClassifier - 0.39828139558960557 - 8.721678802579728
GaussianNB - 0.1045656900300238 - 2.4634349969850162
NearestNeighborsClassifier - 0.2072678331090175 - 14.98547783843361
MLPClassifier - 0.09990682265244849 - 20.779171113193048
LinearTreeClassifier - 0.1273423749870587 - 31.453768963336618
LinearForestClassifier - 0.4053214618490527 - 2.6103143891629377
LinearBoostClassifier - 0.1291023915519205 - 31.390331536313635


So just looking at the base models without hyper tuning we can see some models definatlly are performing better than others. Because of this we have decided to continue with the following few to see if we can improve the performance:
1. BaggingClassifier
2. RandomForestClassifier
3. ExtraTreesClassifier
4. LinearForestClassifier