# Model Testing Pipeline

Identifying the best model (considered with best parameters of each respectively)
1.   Re-train models using both training and validation data
2.   Evaluate models against the testing data

In [None]:
!pip install ipython-autotime
%load_ext autotime

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ipython-autotime
  Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Collecting jedi>=0.10
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, ipython-autotime
Successfully installed ipython-autotime-0.3.1 jedi-0.18.2
time: 572 µs (started: 2023-01-16 09:37:09 +00:00)


In [None]:
# Basic Libraries

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from datetime import datetime

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
time: 19.8 s (started: 2023-01-16 09:37:09 +00:00)


In [None]:
# Data Source

df = pd.read_csv("/content/drive/MyDrive/cleaned_gee_data.csv")
df = df.drop(columns = ['Unnamed: 0', 'BRIGHTNESS'], axis=1) # BRIGHTNESS deprecated
df.head()

Unnamed: 0,LATITUDE,LONGITUDE,ACQ_DATE,ACQ_TIME,OPEN_TIME,CLOSE_TIME,FIRE_OCCURRED,CO_MOL/M2,SO2_MOL/M2,NO2_MOL/M2,O3_MOL/M2,LOCATION,INSTRUMENT,DRY_SEASON
0,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,0,-0.024223,-0.47444,-1.152277,-0.511001,-1.159086,0,1
1,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,0,0.113599,-0.47444,-1.152277,-0.511001,-1.159086,0,1
2,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,0,-0.024223,-0.47444,-1.361255,-0.511001,-1.159086,0,1
3,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,0,0.113599,-0.47444,-1.361255,-0.511001,-1.159086,0,1
4,-5.433352,-0.197441,-1.723773,0.634294,2.28608,1.793843,0,-0.967684,0.339667,-1.25177,0.426114,-1.159086,0,1


time: 1.83 s (started: 2023-01-16 09:37:29 +00:00)


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171893 entries, 0 to 171892
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   LATITUDE       171893 non-null  float64
 1   LONGITUDE      171893 non-null  float64
 2   ACQ_DATE       171893 non-null  float64
 3   ACQ_TIME       171893 non-null  float64
 4   OPEN_TIME      171893 non-null  float64
 5   CLOSE_TIME     171893 non-null  float64
 6   FIRE_OCCURRED  171893 non-null  int64  
 7   CO_MOL/M2      171893 non-null  float64
 8   SO2_MOL/M2     171893 non-null  float64
 9   NO2_MOL/M2     171893 non-null  float64
 10  O3_MOL/M2      171893 non-null  float64
 11  LOCATION       171893 non-null  float64
 12  INSTRUMENT     171893 non-null  int64  
 13  DRY_SEASON     171893 non-null  int64  
dtypes: float64(11), int64(3)
memory usage: 18.4 MB
time: 28.4 ms (started: 2023-01-16 09:37:31 +00:00)


In [None]:
display(df['FIRE_OCCURRED'].value_counts())

0    170544
1      1349
Name: FIRE_OCCURRED, dtype: int64

time: 7.65 ms (started: 2023-01-16 09:37:31 +00:00)


In [None]:
X = df.drop('FIRE_OCCURRED', axis=1)
y = df['FIRE_OCCURRED']

time: 9.27 ms (started: 2023-01-16 09:37:31 +00:00)


In [None]:
# Training, Testing Split

from sklearn.model_selection import train_test_split

# 90:10

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=10, shuffle=True)

Original = [X_train, X_test, y_train, y_test] # For reference

time: 304 ms (started: 2023-01-16 09:37:31 +00:00)


In [None]:
if len(X_train)==len(y_train) and len(X_test) == len(y_test):
  print("X and y data length matching")
else:
  print("Error in data preparation pipeline")
print()
print("No. of training data = %d" % len(X_train))
print("No. of testing data = %d" % len(X_test))

X and y data length matching

No. of training data = 154703
No. of testing data = 17190
time: 1.98 ms (started: 2023-01-16 09:37:31 +00:00)


In [None]:
display(y_test.value_counts())

0    17059
1      131
Name: FIRE_OCCURRED, dtype: int64

time: 6.88 ms (started: 2023-01-16 09:37:31 +00:00)


In [None]:
# SMOTE

from collections import Counter
from imblearn.over_sampling import SMOTE 

print('Original dataset shape %s' % Counter(y_train))
sm = SMOTE(random_state=10)
X_train, y_train = sm.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_train))

Original dataset shape Counter({0: 153485, 1: 1218})
Resampled dataset shape Counter({0: 153485, 1: 153485})
time: 1.01 s (started: 2023-01-16 09:37:31 +00:00)


In [None]:
# Shuffle Data since SMOTE appended many 1s at the end
# Required for some algorithms such as ANN

from sklearn.utils import shuffle

X_train, y_train = shuffle(X_train, y_train, random_state = 10)

time: 123 ms (started: 2023-01-16 09:37:32 +00:00)


In [None]:
# Evaluation Metrics

from sklearn.metrics import confusion_matrix, recall_score, f1_score, roc_auc_score, accuracy_score

def evaluation_metrics(y_true, y_pred):
  cfm = confusion_matrix(y_true, y_pred).ravel()
  acc = accuracy_score(y_true, y_pred)
  recs = recall_score(y_true, y_pred, average='binary')
  f1s = f1_score(y_true, y_pred, average='binary')
  rocs = roc_auc_score(y_true, y_pred, average='macro')
  return [cfm, acc, recs, f1s, rocs]

time: 1.49 ms (started: 2023-01-16 09:37:32 +00:00)


Confusion matrix format : [ tn , fp , fn , tp ]

In [None]:
# Store Model Parameters and Eval

models_final = pd.DataFrame(columns = ['model_name', 'model', 'parameters'])
models_test = pd.DataFrame(columns = ['model_name', 'confusion_matrix', 'accuracy', 'recall', 'f1_score', 'roc_auc_score'])

time: 4.71 ms (started: 2023-01-16 09:37:32 +00:00)


In [None]:
# Import ML Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost
from xgboost import XGBClassifier
import lightgbm
from lightgbm import LGBMClassifier
import tensorflow as tf
from tensorflow import keras
from sklearn.ensemble import VotingClassifier

# Save Model

import pickle

time: 4.28 s (started: 2023-01-16 09:37:32 +00:00)


## Logistic Regression

- Library: Scikit-learn

**Best Parameters:**

{'warm_start': True,
 'solver': 'lbfgs',
 'penalty': 'none',
 'max_iter': 331,
 'dual': False,
 'C': 0}

In [None]:
name = 'log_clf'

log_clf = LogisticRegression(penalty = 'none', 
                             warm_start = True,
                             solver = 'lbfgs',
                             max_iter = 331,
                             dual = False,
                             C = 0,
                             n_jobs = -1, 
                             random_state = 10
                             ).fit(X_train,y_train)

y_true = y_test
y_pred = log_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': log_clf, 
                        'parameters': log_clf.get_params()}, 
                        ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)



time: 5.55 s (started: 2023-01-16 09:37:37 +00:00)


In [None]:
# Save Model
pickle.dump(log_clf, open('log_clf.sav', 'wb')) 

# Load Model
# log_clf = pickle.load(open('log_clf.sav', 'rb'))

time: 5.34 ms (started: 2023-01-16 09:37:42 +00:00)


## Support Vector Machine (SVM)

- Library: Scikit-learn

**Best Parameters:**

{
'kernel': 'rbf',
'C' : '8',
'class_weight'='balanced'
}

In [None]:
X_train_SVM = Original[0]
X_test_SVM = Original[1]
y_train_SVM = Original[2]
y_test_SVM = Original[3]

time: 1.23 ms (started: 2023-01-16 09:37:42 +00:00)


In [None]:
# Shuffle

X_train_SVM, y_train_SVM = shuffle(X_train_SVM, y_train_SVM, random_state = 10)

time: 99.5 ms (started: 2023-01-16 09:37:42 +00:00)


In [None]:
display(y_train_SVM.value_counts())

0    153485
1      1218
Name: FIRE_OCCURRED, dtype: int64

time: 44.7 ms (started: 2023-01-16 09:37:42 +00:00)


In [None]:
name = 'svc_clf'

svc_clf = SVC(kernel = 'rbf', 
              C = 8,
              class_weight='balanced',
              random_state = 10
              ).fit(X_train_SVM,y_train_SVM)

y_true = y_test_SVM
y_pred = svc_clf.predict(X_test_SVM)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': svc_clf, 
                        'parameters': svc_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 5min 18s (started: 2023-01-16 09:37:42 +00:00)


In [None]:
# Save Model
pickle.dump(svc_clf, open('svc_clf.sav', 'wb')) 

# Load Model
# svc_clf = pickle.load(open('svc_clf.sav', 'rb'))

time: 4.7 ms (started: 2023-01-16 09:43:01 +00:00)


## Naive Bayes

- Library: Scikit-learn

**Best Parameters:**

{'var_smoothing': 1e-3}

In [None]:
name = 'bayes_clf'

bayes_clf = GaussianNB(var_smoothing = 1e-3
                       ).fit(X_train,y_train)

y_true = y_test
y_pred = bayes_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': bayes_clf, 
                        'parameters': bayes_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 143 ms (started: 2023-01-16 09:43:01 +00:00)


In [None]:
# Save Model
pickle.dump(bayes_clf, open('bayes_clf.sav', 'wb')) 

# Load Model
# bayes_clf = pickle.load(open('bayes_clf.sav', 'rb'))

time: 3.83 ms (started: 2023-01-16 09:43:01 +00:00)


## K-Nearest Neighbor

- Library: Scikit-learn

**Best Parameters:**

{'n_neighbors': 19, 'algorithm': 'kd_tree'}

In [None]:
name = 'neigh_clf'

neigh_clf = KNeighborsClassifier(n_neighbors = 19, 
                                 algorithm = 'kd_tree',
                                 n_jobs = -1, 
                                 ).fit(X_train,y_train)

y_true = y_test
y_pred = neigh_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': neigh_clf, 
                        'parameters': neigh_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 12.4 s (started: 2023-01-16 09:43:01 +00:00)


In [None]:
# Save Model
pickle.dump(neigh_clf, open('neigh_clf.sav', 'wb')) 

# Load Model
# neigh_clf = pickle.load(open('neigh_clf.sav', 'rb'))

time: 126 ms (started: 2023-01-16 09:43:14 +00:00)


## Decision Tree

- Library: Scikit-learn

**Best Parameters:**

{'splitter': 'random',
 'min_samples_leaf': 2,
 'max_features': 11,
 'max_depth': None,
 'criterion': 'entropy'}

In [None]:
name = 'tree_clf'

tree_clf = DecisionTreeClassifier(criterion = 'entropy', 
                                  splitter = 'random', 
                                  min_samples_leaf = 2,
                                  max_features = 11,
                                  max_depth = None,
                                  random_state = 10
                                  ).fit(X_train,y_train)

y_true = y_test
y_pred = tree_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': tree_clf, 
                        'parameters': tree_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 661 ms (started: 2023-01-16 09:43:14 +00:00)


In [None]:
# Save Model
pickle.dump(tree_clf, open('tree_clf.sav', 'wb')) 

# Load Model
# tree_clf = pickle.load(open('tree_clf.sav', 'rb'))

time: 3.29 ms (started: 2023-01-16 09:43:15 +00:00)


## Random Forest Classifier

- Library: Scikit-learn

**Best Parameters:**

{'n_estimators': 415,
 'min_samples_split': 6,
 'min_samples_leaf': 1,
 'max_features': 4,
 'max_depth': 18}

In [None]:
name = 'rnd_clf'

rnd_clf = RandomForestClassifier(n_estimators = 415, 
                                  min_samples_split = 6,
                                  min_samples_leaf = 1,
                                  max_features = 4,
                                  max_depth = 18, 
                                  n_jobs = -1, 
                                  random_state = 10
                                  ).fit(X_train,y_train)

y_true = y_test
y_pred = rnd_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': rnd_clf, 
                        'parameters': rnd_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 4min 52s (started: 2023-01-16 09:43:15 +00:00)


In [None]:
# Save Model
# pickle.dump(rnd_clf, open('rnd_clf.sav', 'wb')) 
pickle.dump(rnd_clf, open('rnd_clf.h', 'wb')) 

# Load Model
# rnd_clf = pickle.load(open('rnd_clf.sav', 'rb'))

time: 162 ms (started: 2023-01-16 09:48:07 +00:00)


## Gradient Boosting Classifier

- Library: Scikit-learn

**Best Parameters:**

{'n_estimators': 1000, 'max_depth': 8, 'learning_rate': 0.1}

In [None]:
name = 'gboost_clf'

gboost_clf = GradientBoostingClassifier(n_estimators = 1000, 
                                        learning_rate = 0.1, 
                                        max_depth = 8,
                                        random_state = 10
                                        ).fit(X_train,y_train)

y_true = y_test
y_pred = gboost_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': gboost_clf, 
                        'parameters': gboost_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 42min 54s (started: 2023-01-16 09:48:07 +00:00)


In [None]:
# Save Model
pickle.dump(gboost_clf, open('gboost_clf.sav', 'wb')) 

# Load Model
# gboost_clf = pickle.load(open('gboost_clf.sav', 'rb'))

time: 64.3 ms (started: 2023-01-16 10:31:01 +00:00)


## XGBoost

- Library: xgboost

**Best Parameters:**

{'n_estimators': 1000,
 'min_child_weight': 7,
 'max_depth': 8,
 'learning_rate': 0.1}

In [None]:
name = 'xgboost_clf'

xgboost_clf = XGBClassifier(n_estimators = 1000, 
                            learning_rate = 0.1,
                            max_depth = 8, 
                            min_child_weight = 7,
                            random_state = 10
                            ).fit(X_train,y_train)

y_true = y_test
y_pred = xgboost_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': xgboost_clf, 
                        'parameters': xgboost_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 10min 44s (started: 2023-01-16 10:31:02 +00:00)


In [None]:
# Save Model
pickle.dump(xgboost_clf, open('xgboost_clf.sav', 'wb')) 

# Load Model
# xgboost_clf = pickle.load(open('xgboost_clf.sav', 'rb'))

time: 11.1 ms (started: 2023-01-16 10:41:46 +00:00)


## LightGBM

- Library: lightbgm

**Best Parameters:**

{'num_leaves': 50,
 'n_estimators': 1000,
 'min_data_in_leaf': 10,
 'max_depth': 8,
 'learning_rate': 0.05}

In [None]:
name = 'lightgbm_clf'

lightgbm_clf = LGBMClassifier(n_estimators = 1000, 
                              learning_rate = 0.05, 
                              max_depth = 8,
                              num_leaves = 50,
                              min_data_in_leaf = 10,
                              random_state = 10
                              ).fit(X_train,y_train)

y_true = y_test
y_pred = lightgbm_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': lightgbm_clf, 
                        'parameters': lightgbm_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 45.8 s (started: 2023-01-16 10:41:46 +00:00)


In [None]:
# Save Model
pickle.dump(lightgbm_clf, open('lightgbm_clf.sav', 'wb')) 

# Load Model
# lightgbm_clf = pickle.load(open('lightgbm_clf.sav', 'rb'))

time: 278 ms (started: 2023-01-16 10:42:32 +00:00)


## Artificial Neural Network

- Library: Keras, Tensorflow

**Best Parameters:**
- Batch size 15
- EPOCH 50

In [None]:
tf.random.set_seed(10)

time: 1.37 ms (started: 2023-01-16 10:42:32 +00:00)


In [None]:
name = 'ann_clf'

ann_clf = keras.models.Sequential([
    keras.layers.Dense(15, input_shape=(X_train.shape[1],), activation='relu'), # No bias term
    # keras.layers.Dense(10, activation='relu'), 
    keras.layers.Dense(10, activation='relu'), 
    keras.layers.Dense(1, activation='sigmoid')
])

ann_clf.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 15)                210       
                                                                 
 dense_1 (Dense)             (None, 10)                160       
                                                                 
 dense_2 (Dense)             (None, 1)                 11        
                                                                 
Total params: 381
Trainable params: 381
Non-trainable params: 0
_________________________________________________________________
time: 239 ms (started: 2023-01-16 10:42:32 +00:00)


In [None]:
display(y_train.value_counts())

1    153485
0    153485
Name: FIRE_OCCURRED, dtype: int64

time: 20.8 ms (started: 2023-01-16 10:42:32 +00:00)


In [None]:
display(y_test.value_counts())

0    17059
1      131
Name: FIRE_OCCURRED, dtype: int64

time: 10.1 ms (started: 2023-01-16 10:42:32 +00:00)


In [None]:
ann_clf.compile(optimizer = 'adam', 
                metrics=['accuracy'], 
                loss ='binary_crossentropy')

record = ann_clf.fit(
            X_train, 
            y_train, 
            validation_data = (X_test, y_test), 
            batch_size = 15, 
            epochs = 50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
time: 41min 22s (started: 2023-01-16 10:42:32 +00:00)


In [None]:
y_true = y_test
prediction = ann_clf.predict(X_test)
prediction = pd.Series(prediction[:, 0])
y_pred = []

for i in range(len(prediction)):
  if prediction[i] >= 0.5:
    y_pred.append(1)
  else:
    y_pred.append(0)

evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': ann_clf, 
                        'parameters': ann_clf.layers}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

time: 1.55 s (started: 2023-01-16 11:23:55 +00:00)


In [None]:
# Save Model
ann_clf.save('ann_clf.h5') 

# Load Model
# ann_clf = tf.keras.models.load_model('ann_clf.h5')

time: 23.1 ms (started: 2023-01-16 11:23:57 +00:00)


## Voting Classifier
- Library: Scikit-learn, Keras, Tensorflow

**Best Parameters:**

{'voting': 'hard'}

In [None]:
display(models_test)

Unnamed: 0,model_name,confusion_matrix,accuracy,recall,f1_score,roc_auc_score
0,log_clf,"[13920, 3139, 39, 92]",0.815125,0.70229,0.054729,0.759141
1,svc_clf,"[16301, 758, 14, 117]",0.95509,0.89313,0.232604,0.924348
2,bayes_clf,"[11200, 5859, 14, 117]",0.658348,0.89313,0.038317,0.774837
3,neigh_clf,"[16553, 506, 12, 119]",0.969866,0.908397,0.314815,0.939368
4,tree_clf,"[16927, 132, 22, 109]",0.991041,0.832061,0.586022,0.912162
5,rnd_clf,"[16823, 236, 19, 112]",0.985166,0.854962,0.467641,0.920564
6,gboost_clf,"[17008, 51, 21, 110]",0.995812,0.839695,0.753425,0.918353
7,xgboost_clf,"[17001, 58, 22, 109]",0.995346,0.832061,0.731544,0.914331
8,lightgbm_clf,"[16976, 83, 22, 109]",0.993892,0.832061,0.674923,0.913598
9,ann_clf,"[16078, 981, 14, 117]",0.942118,0.89313,0.190399,0.917812


time: 26.5 ms (started: 2023-01-16 11:23:57 +00:00)


In [None]:
print('Best Model By Accuracy')
print(models_test.loc[models_test['accuracy'] == max(models_test['accuracy'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By Recall')
print(models_test.loc[models_test['recall'] == max(models_test['recall'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By F1')
print(models_test.loc[models_test['f1_score'] == max(models_test['f1_score'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By ROC')
print(models_test.loc[models_test['roc_auc_score'] == max(models_test['roc_auc_score'])].model_name.to_string(index=False))
print('-----------------------')

Best Model By Accuracy
gboost_clf
-----------------------
Best Model By Recall
neigh_clf
-----------------------
Best Model By F1
gboost_clf
-----------------------
Best Model By ROC
neigh_clf
-----------------------
time: 14.1 ms (started: 2023-01-16 11:23:57 +00:00)


In [None]:
name = 'ensem_clf'

ensem_clf = VotingClassifier(estimators=[('m1', neigh_clf), ('m2', gboost_clf), ('m3', xgboost_clf)],
                             voting = 'hard',
                             n_jobs = -1, 
                             ).fit(X_train,y_train)

y_true = y_test
y_pred = ensem_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': ensem_clf, 
                        'parameters': ensem_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(ensem_clf, open('ensem_clf.sav', 'wb')) 

# Load Model
# ensem_clf = pickle.load(open('ensem_clf.sav', 'rb'))

# Model Testing Result

### Export Model

In [None]:
# !pip install micromlgen

In [None]:
# from micromlgen import port

# print(port(rnd_clf))

### Results

In [None]:
display(models_final)

In [None]:
display(models_test)

In [None]:
print('Best Model By Accuracy')
print(models_test.loc[models_test['accuracy'] == max(models_test['accuracy'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By Recall')
print(models_test.loc[models_test['recall'] == max(models_test['recall'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By F1')
print(models_test.loc[models_test['f1_score'] == max(models_test['f1_score'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By ROC')
print(models_test.loc[models_test['roc_auc_score'] == max(models_test['roc_auc_score'])].model_name.to_string(index=False))
print('-----------------------')

After model testing, the best model identified is xgboost_clf with the following parameters

* booster = 'gbtree',
* verbosity = 1,
* n_estimators = 750, 
* learning_rate = 0.01,
* max_depth = 10, 
* min_child_weight = 1,
* sampling_method = 'uniform',
* gamma = 0,
* random_state = 10

Full version:
{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.01, 'max_delta_step': 0, 'max_depth': 10, 'min_child_weight': 1, 'missing': None, 'n_estimators': 750, 'n_jobs': 1, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 10, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'verbosity': 1, 'sampling_method': 'uniform'}

### **Key Findings**
**General**
* All models have a fairly good accuracy and recall score
* F1 score, on the other hand, is quite poor for some
  * This is due to the underfitting on an unbalanced dataset. Although SMOTE technique has been applied, some model algorithms are unable to capture the relationship between the input and output variables accurately even with the synthetic data
* ROC score is the least important evaluation metrics here since it averages over all possible evaluation thresholds. It is just used for reference.

**Random Forest Classifier**
* Very quickly trained but performs slightly worse than xgboost_clf in all categories
* Slightly worse than tree_clf in recall and roc

**Decision Tree**
* Very quickly trained but performs slightly worse than xgboost_clf in all categories
* Slightly worse than rnd_clf in accuracy and f1

**Support Vector Machine**
* The SVM was deprecated early on in the development due to its many requirements and poor performance
* It requires less training data, thus requiring undersampling technique to be applied
* Despite this, it is still unable to gain an accuracy of over 80% and take notoriously long to train
* Even the best parameters for it were not identified as it would take too long for possibly the worst result of all models

**Logistic Regression**
* Not that computationally expensive but poor performance generally in relative to other models, especially in f1_score

**Naive Bayes**
* Not that computationally expensive but poor performance generally in relative to other models, especially in f1_score

**K-Nearest Neighbor**
* A slightly worse version of gboost_clf

**Gradient Boosting Classifier**
* Although the best recall score is obtained by the gboost_clf and ann_clf, it performs massively worse in f1_score and is thus disqualified. 

**XGBoost**
* xgboost_clf is the best performer in all metrics used except for recall, where it comes in 2nd
 * Upon further inspection, it is revealed that it is only be a different of 1 misclassification of 1 false negative case. 
 * Thus, this can be overlooked
* However, one massive downside to xgboost_clf is that it takes significantly longer to train when compared to other models that perform slightly worse (rnd_clf and tree_clf)

**LightGBM**
* Similar to gboost_clf and ann_clf with its poor performance in f1_score, but also worse than the other 2 in recall score

**Artificial Neural Network**
* Although the best recall score is obtained by the gboost_clf and ann_clf, it performs massively worse in f1_score and is thus disqualified. 
* Not much experiment has been conducted on ann_clf yet but this is due to the massive computational resources required. Note that this ann_clf is already nearly optimized in many of its parameters.

**Ensemble Learning**
* The application of ensemble learning did not improve the result of xgboost_clf especially in the recall score and took a long time to train. 

### **Conclusion:**
* Thus, xgboost_clf is the best performing model.
* Personally, I would rate xgboost_clf > tree_clf = rnd_clf > ensem_clf > ann_clf = gboost_clf > lightgbm_clf >>> rest
* If we need to retrain a model quickly, either tree_clf and rnd_clf would be more applicable.