# Model Testing Pipeline

Identifying the best model (considered with best parameters of each respectively)
1.   Re-train models using both training and validation data
2.   Evaluate models using testing data

In [2]:
# Basic Libraries

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from datetime import datetime

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Data Source

df = pd.read_csv("/content/drive/MyDrive/Senior Project/Datasets/cleaned_gee_data.csv")
df = df.drop(columns = ['Unnamed: 0'], axis=1)
df.head()

Unnamed: 0,LATITUDE,LONGITUDE,ACQ_DATE,ACQ_TIME,OPEN_TIME,CLOSE_TIME,BRIGHTNESS,FIRE_OCCURRED,CO_MOL/M2,SO2_MOL/M2,NO2_MOL/M2,O3_MOL/M2,LOCATION,INSTRUMENT,DRY_SEASON
0,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,-2.231078,0,-0.024223,-0.47444,-1.152277,-0.511001,-1.159086,0,1
1,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,-2.231078,0,0.113599,-0.47444,-1.152277,-0.511001,-1.159086,0,1
2,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,-2.231078,0,-0.024223,-0.47444,-1.361255,-0.511001,-1.159086,0,1
3,-5.466232,-0.176027,-1.866392,0.634294,0.506405,0.526945,-2.231078,0,0.113599,-0.47444,-1.361255,-0.511001,-1.159086,0,1
4,-5.433352,-0.197441,-1.723773,0.634294,2.28608,1.793843,-1.141613,0,-0.967684,0.339667,-1.25177,0.426114,-1.159086,0,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171893 entries, 0 to 171892
Data columns (total 15 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   LATITUDE       171893 non-null  float64
 1   LONGITUDE      171893 non-null  float64
 2   ACQ_DATE       171893 non-null  float64
 3   ACQ_TIME       171893 non-null  float64
 4   OPEN_TIME      171893 non-null  float64
 5   CLOSE_TIME     171893 non-null  float64
 6   BRIGHTNESS     171893 non-null  float64
 7   FIRE_OCCURRED  171893 non-null  int64  
 8   CO_MOL/M2      171893 non-null  float64
 9   SO2_MOL/M2     171893 non-null  float64
 10  NO2_MOL/M2     171893 non-null  float64
 11  O3_MOL/M2      171893 non-null  float64
 12  LOCATION       171893 non-null  float64
 13  INSTRUMENT     171893 non-null  int64  
 14  DRY_SEASON     171893 non-null  int64  
dtypes: float64(12), int64(3)
memory usage: 19.7 MB


In [5]:
display(df['FIRE_OCCURRED'].value_counts())

0    170544
1      1349
Name: FIRE_OCCURRED, dtype: int64

In [6]:
X = df.drop('FIRE_OCCURRED', axis=1)
y = df['FIRE_OCCURRED']

In [7]:
# Training, Testing Split

from sklearn.model_selection import train_test_split

# 90:10

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=10, shuffle=True)

Original = [X_train, X_test, y_train, y_test] # For reference

In [8]:
if len(X_train)==len(y_train) and len(X_test) == len(y_test):
  print("X and y data length matching")
else:
  print("Error in data preparation pipeline")
print()
print("No. of training data = %d" % len(X_train))
print("No. of testing data = %d" % len(X_test))

X and y data length matching

No. of training data = 154703
No. of testing data = 17190


In [9]:
display(y_test.value_counts())

0    17059
1      131
Name: FIRE_OCCURRED, dtype: int64

In [10]:
# SMOTE

from collections import Counter
from imblearn.over_sampling import SMOTE 

print('Original dataset shape %s' % Counter(y_train))
sm = SMOTE(random_state=10)
X_train, y_train = sm.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_train))

Original dataset shape Counter({0: 153485, 1: 1218})
Resampled dataset shape Counter({0: 153485, 1: 153485})


In [11]:
# Shuffle Data

from sklearn.utils import shuffle

X_train, y_train = shuffle(X_train, y_train, random_state = 10)

In [12]:
# Evaluation Metrics

from sklearn.metrics import confusion_matrix, recall_score, f1_score, roc_auc_score, accuracy_score

def evaluation_metrics(y_true, y_pred):
  cfm = confusion_matrix(y_true, y_pred).ravel()
  acc = accuracy_score(y_true, y_pred)
  recs = recall_score(y_true, y_pred, average='binary')
  f1s = f1_score(y_true, y_pred, average='binary')
  rocs = roc_auc_score(y_true, y_pred, average='macro')
  return [cfm, acc, recs, f1s, rocs]

Confusion matrix format : [ tn , fp , fn , tp ]

In [13]:
# Store Model Parameters and Eval

models_final = pd.DataFrame(columns = ['model_name', 'model', 'parameters'])
models_test = pd.DataFrame(columns = ['model_name', 'confusion_matrix', 'accuracy', 'recall', 'f1_score', 'roc_auc_score'])

In [14]:
# Import ML Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost
from xgboost import XGBClassifier
import lightgbm
from lightgbm import LGBMClassifier
import tensorflow as tf
from tensorflow import keras
from sklearn.ensemble import VotingClassifier

# Save Model

import pickle

## Logistic Regression

- Library: Scikit-learn
- Around 30 seconds to train

**Best Parameters:**

{'warm_start': False,
 'solver': 'newton-cg',
 'penalty': 'none',
 'max_iter': 247,
 'dual': False,
 'C': 0}

In [None]:
name = 'log_clf'

log_clf = LogisticRegression(penalty = 'none', 
                             warm_start = False,
                             solver = 'newton-cg',
                             max_iter = 247,
                             dual = False,
                             C = 0,
                             n_jobs = -1, 
                             random_state = 10
                             ).fit(X_train,y_train)

y_true = y_test
y_pred = log_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': log_clf, 
                        'parameters': log_clf.get_params()}, 
                        ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)



In [None]:
# Save Model
pickle.dump(log_clf, open('log_clf.sav', 'wb')) 

# Load Model
# log_clf = pickle.load(open('log_clf.sav', 'rb'))

## Support Vector Machine (SVM)

- Library: Scikit-learn

- Deprecated. Poor performance no matter what.

**Best Parameters:**

In [None]:
X_train_SVM = Original[0]
X_test_SVM = Original[1]
y_train_SVM = Original[2]
y_test_SVM = Original[3]

In [None]:
y_train_SVM.value_counts()

In [None]:
# Undersampling

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=10)
X_train_SVM, y_train_SVM = rus.fit_resample(X_train_SVM, y_train_SVM)

In [None]:
display(y_train_SVM.value_counts())

In [None]:
name = 'svc_clf'

svc_clf = SVC(kernel = 'rbf', 
              random_state = 10
              ).fit(X_train_SVM,y_train_SVM)

y_true = y_test_SVM
y_pred = svc_clf.predict(X_test_SVM)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': svc_clf, 
                        'parameters': svc_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(svc_clf, open('svc_clf.sav', 'wb')) 

# Load Model
# svc_clf = pickle.load(open('svc_clf.sav', 'rb'))

## Naive Bayes

- Library: Scikit-learn
- 1 minute to train

**Best Parameters:**

{'var_smoothing': 1e-3}

In [14]:
name = 'bayes_clf'

bayes_clf = GaussianNB(var_smoothing = 1e-3
                       ).fit(X_train,y_train)

y_true = y_test
y_pred = bayes_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': bayes_clf, 
                        'parameters': bayes_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [16]:
display(models_final)

Unnamed: 0,model_name,model,parameters
0,bayes_clf,GaussianNB(var_smoothing=0.001),"{'priors': None, 'var_smoothing': 0.001}"


In [17]:
display(models_test)

Unnamed: 0,model_name,confusion_matrix,accuracy,recall,f1_score,roc_auc_score
0,bayes_clf,"[13577, 3482, 17, 114]",0.796451,0.870229,0.061175,0.833057


In [18]:
print(bayes_clf.to_arduino(instance_name='blobClassifier'))

AttributeError: ignored

In [None]:
# Save Model
pickle.dump(bayes_clf, open('bayes_clf.sav', 'wb')) 

# Load Model
# bayes_clf = pickle.load(open('bayes_clf.sav', 'rb'))

## K-Nearest Neighbor

- Library: Scikit-learn
- 1 mins 10 seconds to train

**Best Parameters:**

{'n_neighbors': 5, 'algorithm':'auto', 'leaf_size':30 }

In [None]:
name = 'neigh_clf'

neigh_clf = KNeighborsClassifier(n_neighbors = 5, 
                                 algorithm = 'auto',
                                 leaf_size = 30,
                                 p = 2,
                                 metric = 'minkowski',
                                 n_jobs = -1, 
                                 ).fit(X_train,y_train)

y_true = y_test
y_pred = neigh_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': neigh_clf, 
                        'parameters': neigh_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(neigh_clf, open('neigh_clf.sav', 'wb')) 

# Load Model
# neigh_clf = pickle.load(open('neigh_clf.sav', 'rb'))

## Decision Tree

- Library: Scikit-learn
- Very quick to train. Almost immediately trained.

**Best Parameters:**

{'splitter': 'best',
 'min_samples_leaf': 1,
 'max_features': 9,
 'max_depth': None,
 'criterion': 'gini'}


In [None]:
name = 'tree_clf'

tree_clf = DecisionTreeClassifier(criterion = 'gini', 
                                  splitter = 'best', 
                                  min_samples_leaf = 1,
                                  max_features = 9,
                                  max_depth = None,
                                  random_state = 10
                                  ).fit(X_train,y_train)

y_true = y_test
y_pred = tree_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': tree_clf, 
                        'parameters': tree_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(tree_clf, open('tree_clf.sav', 'wb')) 

# Load Model
# tree_clf = pickle.load(open('tree_clf.sav', 'rb'))

## Random Forest Classifier

- Library: Scikit-learn
- Very quick to train. Only 6 seconds needed.

**Best Parameters:**

{'n_estimators': 300,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 31}

In [15]:
name = 'rnd_clf'

rnd_clf = RandomForestClassifier(n_estimators = 300, 
                                  min_samples_split = 2,
                                  min_samples_leaf = 1,
                                  max_features = 'auto',
                                  max_depth = 31, 
                                  n_jobs = -1, 
                                  random_state = 10
                                  ).fit(X_train,y_train)

y_true = y_test
y_pred = rnd_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': rnd_clf, 
                        'parameters': rnd_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [16]:
display(models_final)

Unnamed: 0,model_name,model,parameters
0,rnd_clf,"(DecisionTreeClassifier(max_depth=31, max_feat...","{'bootstrap': True, 'ccp_alpha': 0.0, 'class_w..."


In [17]:
display(models_test)

Unnamed: 0,model_name,confusion_matrix,accuracy,recall,f1_score,roc_auc_score
0,rnd_clf,"[17038, 21, 14, 117]",0.997964,0.89313,0.869888,0.945949


In [None]:
# Save Model
pickle.dump(rnd_clf, open('rnd_clf.sav', 'wb')) 

# Load Model
# rnd_clf = pickle.load(open('rnd_clf.sav', 'rb'))

## Gradient Boosting Classifier

- Library: Scikit-learn
- 1 mins 25 seconds to train

**Best Parameters:**

{'n_estimators': 500, 'learning_rate':0.5, 'max_depth':1, 'random_state':10}

In [None]:
name = 'gboost_clf'

gboost_clf = GradientBoostingClassifier(n_estimators = 500, 
                                        learning_rate = 0.5, 
                                        max_depth = 1,
                                        criterion = 'friedman_mse',
                                        random_state = 10
                                        ).fit(X_train,y_train)

y_true = y_test
y_pred = gboost_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': gboost_clf, 
                        'parameters': gboost_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(gboost_clf, open('gboost_clf.sav', 'wb')) 

# Load Model
# gboost_clf = pickle.load(open('gboost_clf.sav', 'rb'))

## XGBoost

- Library: xgboost
- ~6 minutes to train

**Best Parameters:**

{'n_estimators': 750, 'learning_rate':0.01, 'max_depth':10}

In [None]:
name = 'xgboost_clf'

xgboost_clf = XGBClassifier(booster = 'gbtree',
                            verbosity = 1,
                            n_estimators = 750, 
                            learning_rate = 0.01,
                            max_depth = 10, 
                            min_child_weight = 1,
                            # reg_alpha = 0, # L1 regularization term
                            # reg_lambda = 1, # L2 regularization term
                            sampling_method = 'uniform',
                            gamma = 0,
                            random_state = 10
                            ).fit(X_train,y_train)

y_true = y_test
y_pred = xgboost_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': xgboost_clf, 
                        'parameters': xgboost_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(xgboost_clf, open('xgboost_clf.sav', 'wb')) 

# Load Model
# xgboost_clf = pickle.load(open('xgboost_clf.sav', 'rb'))

## LightGBM

- Library: lightbgm
- 3 seconds to train

**Best Parameters:**

{'num_leaves': 500,
 'n_estimators': 100,
 'min_data_in_leaf': 100,
 'max_depth': 10,
 'learning_rate': 0.2}

In [None]:
name = 'lightgbm_clf'

lightgbm_clf = LGBMClassifier(booster = 'gbdt',
                              n_estimators = 1000, 
                              num_iterations = 100,
                              learning_rate = 0.1, 
                              max_depth = 3,
                              num_leaves = 31,
                              tree_learner = 'serial',
                              random_state = 10
                              ).fit(X_train,y_train)

y_true = y_test
y_pred = lightgbm_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': lightgbm_clf, 
                        'parameters': lightgbm_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(lightgbm_clf, open('lightgbm_clf.sav', 'wb')) 

# Load Model
# lightgbm_clf = pickle.load(open('lightgbm_clf.sav', 'rb'))

## Artificial Neural Network

- Library: Keras, Tensorflow
- Approximately 30+ minutes to train

**Best Parameters:**

In [None]:
name = 'ann_clf'

ann_clf = keras.models.Sequential([
    keras.layers.Dense(17, input_shape=(X_train.shape[1],), activation='relu'), # 16 columns. 1 bias term to accelerate activation of a node.
    keras.layers.Dense(8, activation='relu'), # One hidden layer is sufficient for the large majority of problems. 
    # Set the number of neurons in the hidden layer as the mean of the neurons in the input and output layers.
    keras.layers.Dense(1, activation='sigmoid'), # Only 1 acceptable unless softmax activation function is used
])

ann_clf.summary()

In [None]:
# Shuffle Data since SMOTE appends many 1s at the end

from sklearn.utils import shuffle

X_train_ANN, y_train_ANN = shuffle(X_train, y_train, random_state = 10)

In [None]:
display(y_train_ANN.value_counts())

In [None]:
display(y_test.value_counts())

In [None]:
ann_clf.compile(optimizer = 'adam', 
                metrics=['accuracy'], 
                loss ='binary_crossentropy')

record = ann_clf.fit(
            X_train_ANN, 
            y_train_ANN, 
            validation_data = (X_test, y_test), 
            batch_size = 10, 
            epochs = 50)

In [None]:
y_true = y_test
prediction = ann_clf.predict(X_test)
prediction = pd.Series(prediction[:, 0])
y_pred = []

for i in range(len(prediction)):
  if prediction[i] >= 0.5:
    y_pred.append(1)
  else:
    y_pred.append(0)

evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': ann_clf, 
                        'parameters': ann_clf.layers}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
ann_clf.save('ann_clf.h5') 

# Load Model
# ann_clf = tf.keras.models.load_model('ann_clf.h5')

## Voting Classifier
- Library: Scikit-learn, Keras, Tensorflow
- 7 minutes to train

**Best Parameters:**

{'voting': 'hard', 'n_jobs': -1}

In [None]:
display(models_test)

Confusion matrix format : [ tn , fp , fn , tp ]

In [None]:
print('Best Model By Accuracy')
print(models_test.loc[models_test['accuracy'] == max(models_test['accuracy'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By Recall')
print(models_test.loc[models_test['recall'] == max(models_test['recall'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By F1')
print(models_test.loc[models_test['f1_score'] == max(models_test['f1_score'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By ROC')
print(models_test.loc[models_test['roc_auc_score'] == max(models_test['roc_auc_score'])].model_name.to_string(index=False))
print('-----------------------')

In [None]:
name = 'ensem_clf'

ensem_clf = VotingClassifier(estimators=[('m1', xgboost_clf), ('m2', tree_clf), ('m3', rnd_clf)],
                             voting = 'hard',
                             n_jobs = -1, 
                             ).fit(X_train,y_train)

y_true = y_test
y_pred = ensem_clf.predict(X_test)
evaluation_results = evaluation_metrics(y_true, y_pred)

models_final = models_final.append({'model_name': name, 
                        'model': ensem_clf, 
                        'parameters': ensem_clf.get_params()}, 
                       ignore_index=True)

models_test = models_test.append({'model_name': name, 
                                  'confusion_matrix' : evaluation_results[0], 
                                  'accuracy': evaluation_results[1], 
                                  'recall' : evaluation_results[2], 
                                  'f1_score': evaluation_results[3],
                                  'roc_auc_score': evaluation_results[4]}, 
                                 ignore_index=True)

In [None]:
# Save Model
pickle.dump(ensem_clf, open('ensem_clf.sav', 'wb')) 

# Load Model
# ensem_clf = pickle.load(open('ensem_clf.sav', 'rb'))

## Discussion

In [None]:
display(models_final)

In [None]:
display(models_test)

In [None]:
print('Best Model By Accuracy')
print(models_test.loc[models_test['accuracy'] == max(models_test['accuracy'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By Recall')
print(models_test.loc[models_test['recall'] == max(models_test['recall'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By F1')
print(models_test.loc[models_test['f1_score'] == max(models_test['f1_score'])].model_name.to_string(index=False))
print('-----------------------')
print('Best Model By ROC')
print(models_test.loc[models_test['roc_auc_score'] == max(models_test['roc_auc_score'])].model_name.to_string(index=False))
print('-----------------------')

After model testing, the best model identified is xgboost_clf with the following parameters

* booster = 'gbtree',
* verbosity = 1,
* n_estimators = 750, 
* learning_rate = 0.01,
* max_depth = 10, 
* min_child_weight = 1,
* sampling_method = 'uniform',
* gamma = 0,
* random_state = 10

Full version:
{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.01, 'max_delta_step': 0, 'max_depth': 10, 'min_child_weight': 1, 'missing': None, 'n_estimators': 750, 'n_jobs': 1, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 10, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'verbosity': 1, 'sampling_method': 'uniform'}

### **Key Findings**
**General**
* All models have a fairly good accuracy and recall score
* F1 score, on the other hand, is quite poor for some
  * This is due to the underfitting on an unbalanced dataset. Although SMOTE technique has been applied, some model algorithms are unable to capture the relationship between the input and output variables accurately even with the synthetic data
* ROC score is the least important evaluation metrics here since it averages over all possible evaluation thresholds. It is just used for reference.

**Random Forest Classifier**
* Very quickly trained but performs slightly worse than xgboost_clf in all categories
* Slightly worse than tree_clf in recall and roc

**Decision Tree**
* Very quickly trained but performs slightly worse than xgboost_clf in all categories
* Slightly worse than rnd_clf in accuracy and f1

**Support Vector Machine**
* The SVM was deprecated early on in the development due to its many requirements and poor performance
* It requires less training data, thus requiring undersampling technique to be applied
* Despite this, it is still unable to gain an accuracy of over 80% and take notoriously long to train
* Even the best parameters for it were not identified as it would take too long for possibly the worst result of all models

**Logistic Regression**
* Not that computationally expensive but poor performance generally in relative to other models, especially in f1_score

**Naive Bayes**
* Not that computationally expensive but poor performance generally in relative to other models, especially in f1_score

**K-Nearest Neighbor**
* A slightly worse version of gboost_clf

**Gradient Boosting Classifier**
* Although the best recall score is obtained by the gboost_clf and ann_clf, it performs massively worse in f1_score and is thus disqualified. 

**XGBoost**
* xgboost_clf is the best performer in all metrics used except for recall, where it comes in 2nd
 * Upon further inspection, it is revealed that it is only be a different of 1 misclassification of 1 false negative case. 
 * Thus, this can be overlooked
* However, one massive downside to xgboost_clf is that it takes significantly longer to train when compared to other models that perform slightly worse (rnd_clf and tree_clf)

**LightGBM**
* Similar to gboost_clf and ann_clf with its poor performance in f1_score, but also worse than the other 2 in recall score

**Artificial Neural Network**
* Although the best recall score is obtained by the gboost_clf and ann_clf, it performs massively worse in f1_score and is thus disqualified. 
* Not much experiment has been conducted on ann_clf yet but this is due to the massive computational resources required. Note that this ann_clf is already nearly optimized in many of its parameters.

**Ensemble Learning**
* The application of ensemble learning did not improve the result of xgboost_clf especially in the recall score and took a long time to train. 

### **Conclusion:**
* Thus, xgboost_clf is the best performing model.
* Personally, I would rate xgboost_clf > tree_clf = rnd_clf > ensem_clf > ann_clf = gboost_clf > lightgbm_clf >>> rest
* If we need to retrain a model quickly, either tree_clf and rnd_clf would be more applicable.