### Step 1: Import Required Libraries

If the following code doesn't run, then do 'pip install ipynb' in the command line. This code lets us import functions from notebooks in the lib folder. Lib has all of the feature extraction and model training/predicting functions.

In [20]:
import ipynb
import sys
sys.path.append('../lib/')

If the following code doesn't run, then do 'pip install imblearn' in the command line. This code lets us do SMOTE (synthetic minority oversampling technique) and random undersampling to help deal with the imbalanced data.

In [21]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

Here we import the remaining libraries that we'll need.

In [22]:
import pandas as pd
import numpy as np
import math
import os
import scipy.io
import pickle
import bz2
import time
import _pickle as cPickle
from sklearn.metrics import pairwise_distances, classification_report, confusion_matrix, roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import cross_val_score

### Step 2: Set Work Directories

In [23]:
np.random.seed(2020)

Here we set the directories for the training set points and labels.

In [24]:
train_dir = '../data/train_set/'
train_image_dir = train_dir+"images/"
train_pt_dir = train_dir+"points/"
train_label_path = train_dir+"label.csv"

### Step 3: Set Up Controls

In this cell, we have a set of controls for the feature extraction. If true, then we process the features from scratch, and if false, then we load existing features from files in the output folder. 

+ (T/F) initial feature extraction on training set
+ (T/F) initial feature extraction on test set

+ (T/F) improved feature extraction on training set
+ (T/F) improved feature extraction on test set

+ (T/F) SMOTE using improved features on train set

+ (T/F) PCA using improved features on training set and test set (doesn't make sense to only do one from scratch and not the other)

In [25]:
run_feature_train_initial = True
run_feature_test_initial = True

run_feature_train = True 
run_feature_test = True 

run_feature_train_SMOTE = True

run_feature_PCA = True

In this cell, we have a set of controls for model training/testing. If true, then we train the model and generate predictions on the test set, and if false, then we skip that model. By default only the baseline and advanced models are set to run, but you can set the other models to be True to see how they perform.

In [26]:
run_baseline = True
run_advanced = True

run_baseline_improved = True
run_baseline_pca = True
run_knn = True
run_knn_smote = True
run_xgboost=True
feature_initial=True
run_random_forest=True
run_LDA=True
run_logistic=True
run_weighted_logistic=True
run_svm = True
run_svm_pca = True
run_weighted_svm = True
run_lasso = True
run_weighted_lasso = True
run_bagging_smote = True
run_naivebayes = True

### Step 4: Import Data and Train-Test Split

Here we import the data, and we can see that the dataset is imbalanced and that there are more records with basic emotions than records with complex emotions.

In [27]:
info = pd.read_csv(train_label_path)
n = info.shape[0]

#Data is imbalanced 
print('Number of records with label 0 (basic emotion):   {:4d} '.format(info.loc[info['label']==0].shape[0]))
print('Number of records with label 1 (complex emotion): {:2d} '.format(info.loc[info['label']==1].shape[0]))

Number of records with label 0 (basic emotion):   2402 
Number of records with label 1 (complex emotion): 598 


We do an 80-20 train-test split.

In [28]:
n_train = int(round(n*(4/5),0))
train_idx = np.random.choice(list(info.index),size=n_train,replace=False)
test_idx = list(set(list(info.index))-set(train_idx)) #set difference

Fiducial points are stored in matlab format. In this step, we read them and store them in a list.

In [30]:
#function to read fiducial points
#input: index
#output: matrix of fiducial points corresponding to the index

n_files = len(os.listdir(train_pt_dir))

def readMat_matrix(index):
    try:
        mat_data = scipy.io.loadmat(train_pt_dir+'{:04d}'.format(index)+'.mat')['faceCoordinatesUnwarped']
    except KeyError:
        mat_data = scipy.io.loadmat(train_pt_dir+'{:04d}'.format(index)+'.mat')['faceCoordinates2']
    return np.matrix.round(mat_data,0)

#load fiducial points into list and store them in output
fiducial_pt_list = list(map(readMat_matrix,list(range(1,n_files+1))))
pickle.dump(fiducial_pt_list, open( "../output/fiducial_pt_list.p", "wb" ) )

### Step 5: Construct Features and Responses

#### Starter Code Features

Use feature.ipynb's feature_initial function to generate pairwise distance features for the baseline model. This is the same feature extraction method as that of the starter code. Note that this method counts distances from x-axis and from y-axis separately between points.

Feature extraction times exclude the time it takes to write to an output file.

In [31]:
from ipynb.fs.full.feature import feature_initial

tm_feature_train_intitial = np.nan
if run_feature_train_initial == True:
    start = time.time()
    dat_train_initial = feature_initial(fiducial_pt_list, train_idx, info)
    end = time.time()
    tm_feature_train_initial = end-start
    with bz2.BZ2File('../output/train_data_initial' + '.pbz2', 'w') as f: 
        cPickle.dump(dat_train_initial, f)
    print('Initial feature extraction time for train: {:4f}'.format(tm_feature_train_initial))
else:
    dat_train_initial = cPickle.load(bz2.BZ2File('../output/train_data_initial.pbz2', 'rb'))
        
        
tm_feature_test_initial = np.nan
if run_feature_test_initial == True:
    start = time.time()
    dat_test_initial = feature_initial(fiducial_pt_list, test_idx, info)
    end = time.time()
    tm_feature_test_initial = end-start
    with bz2.BZ2File('../output/test_data_initial' + '.pbz2', 'w') as f: 
        cPickle.dump(dat_test_initial, f)
    print('Initial feature extraction time for test:  {:4f}'.format(tm_feature_test_initial))
else:
    dat_test_initial = cPickle.load(bz2.BZ2File('../output/test_data_initial.pbz2', 'rb'))

Initial feature extraction time for train: 5.857164
Initial feature extraction time for test:  1.419747


In [32]:
feature_train_initial = dat_train_initial.loc[:, dat_train_initial.columns != 'labels']
label_train_initial = dat_train_initial['labels']

feature_test_initial = dat_test_initial.loc[:, dat_test_initial.columns != 'labels']
label_test_initial = dat_test_initial['labels'] 

#### Improved Features

Use feature's feature_improved function to generate pairwise euclidean distance features to be used by all of the models other than the baseline. Since feature_improved just uses a single euclidean distance value rather than separate x-distance and y-distance values, feature_improved produces exactly half as many features as feature_initial while keeping the same information.

In [33]:
from ipynb.fs.full.feature import feature_improved

tm_feature_train_improved = np.nan
if run_feature_train == True:
    start = time.time()
    dat_train = feature_improved(fiducial_pt_list, train_idx, info)
    end = time.time()
    tm_feature_train_improved = end-start
    with bz2.BZ2File('../output/train_data' + '.pbz2', 'w') as f: 
        cPickle.dump(dat_train, f)
    print('Improved feature extraction time for train: {:4f}'.format(tm_feature_train_improved))
else:
    dat_train = cPickle.load(bz2.BZ2File('../output/train_data.pbz2', 'rb'))


tm_feature_test_improved = np.nan
if run_feature_test == True:
    start = time.time()
    dat_test = feature_improved(fiducial_pt_list, test_idx, info)
    end = time.time()
    tm_feature_test_improved = end-start
    with bz2.BZ2File('../output/test_data' + '.pbz2', 'w') as f: 
        cPickle.dump(dat_test, f)
    print('Improved feature extraction time for test:  {:4f}'.format(tm_feature_test_improved))
else:
    dat_test = cPickle.load(bz2.BZ2File('../output/test_data.pbz2', 'rb'))

Improved feature extraction time for train: 0.187968
Improved feature extraction time for test:  0.040353


In [34]:
feature_train = dat_train.loc[:, dat_train.columns != 'labels']
label_train = dat_train['labels'] 

feature_test = dat_test.loc[:, dat_test.columns != 'labels']
label_test = dat_test['labels']

#### SMOTE Features

Here we do the feature extraction for SMOTE which will be discussed more in the advanced model section. SMOTE is only done on the training data and not on the test data. SMOTE is a modification of the improved features. 

If the improved features are obtained from scratch, then we include the time it takes to get the improved features with the time it takes to use SMOTE. Otherwise, in the case where the improved features are loaded from the disk, we just use the time it takes to use SMOTE on the features.

In [35]:
from ipynb.fs.full.feature import feature_SMOTE

tm_feature_train_SMOTE = np.nan
if run_feature_train_SMOTE == True:
    start = time.time()
    dat_train_SMOTE = feature_SMOTE(dat_train)
    end = time.time()
    if pd.isnull(tm_feature_train_improved):
        tm_feature_train_SMOTE = end-start
    else:
        tm_feature_train_SMOTE = (end-start)+tm_feature_train_improved
    with bz2.BZ2File('../output/train_data_SMOTE' + '.pbz2', 'w') as f: 
        cPickle.dump(dat_train_SMOTE, f)
    print('SMOTE feature extraction time for train: {:4f}'.format(tm_feature_train_SMOTE))
else:
    dat_train_SMOTE = cPickle.load(bz2.BZ2File('../output/train_data_SMOTE.pbz2', 'rb'))

SMOTE feature extraction time for train: 2.498043


In [36]:
feature_train_sm = dat_train_SMOTE.loc[:,dat_train_SMOTE.columns!='labels']
label_train_sm = dat_train_SMOTE['labels']

After undersampling/oversampling, we now have equal number of members in each class (move this to feature section later)

In [37]:
print('Number of records with label 0 (basic emotion):   {:4d} '.format(len(label_train_sm)-sum(label_train_sm)))
print('Number of records with label 1 (complex emotion): {:2d} '.format(sum(label_train_sm)))

Number of records with label 0 (basic emotion):   1929 
Number of records with label 1 (complex emotion): 1929 


#### PCA Features

Finally, here we do PCA which is only done for the model candidates that were not chosen for the advanced model. PCA is done as a modification of the improved features. Also, it doesn't make sense to only do the PCA transformation on one of either the training data or test data.

In [38]:
from ipynb.fs.full.feature import feature_PCA

tm_feature_train_PCA = np.nan
tm_feature_test_PCA = np.nan

if run_feature_PCA == True:
    [dat_train_PCA, dat_test_PCA, tm_feature_train_PCA, tm_feature_test_PCA] = feature_PCA(dat_train,dat_test)
    
    if pd.isnull(tm_feature_train_improved)==False:
        tm_feature_train_PCA = tm_feature_train_PCA+tm_feature_train_improved
    if pd.isnull(tm_feature_test_improved)==False:
        tm_feature_test_PCA = tm_feature_test_PCA+tm_feature_test_improved
    
    with bz2.BZ2File('../output/train_data_PCA' + '.pbz2', 'w') as f: 
        cPickle.dump(dat_train_PCA, f)
    with bz2.BZ2File('../output/test_data_PCA' + '.pbz2', 'w') as f: 
        cPickle.dump(dat_test_PCA, f)
        
    print('PCA feature extraction time for train: {:4f}'.format(tm_feature_train_PCA))
    print('PCA feature extraction time for test:  {:4f}'.format(tm_feature_test_PCA))
        
else:
    dat_train_PCA = cPickle.load(bz2.BZ2File('../output/train_data_PCA.pbz2', 'rb'))
    dat_test_PCA = cPickle.load(bz2.BZ2File('../output/test_data_PCA.pbz2', 'rb'))
    

PCA feature extraction time for train: 7.268351
PCA feature extraction time for test:  0.118223


In [39]:
feature_train_PCA = dat_train_PCA.loc[:,dat_train_PCA.columns!='labels']
label_train_PCA = dat_train_PCA['labels'] #labels are same as label_train

feature_test_PCA = dat_test_PCA.loc[:,dat_test_PCA.columns!='labels']
label_test_PCA = dat_test_PCA['labels'] #labels are same as label_test

### Step 6: Baseline Model ~58% Balanced Accuracy

Before discussing the models, we will go over the two main metrics we used for finding a better model than the baseline. Since the data is imbalanced, we examined AUC and balanced accuracy rather than the regular accuracy metric.

AUC is the area under the ROC curve which measures the TP vs FP rate as the classification decision threshold changes over time.

Balanced accuracy is given by the formula $$balanced\_accuracy = \frac{1}{2}(\frac{TP}{TP+FN}+\frac{TN}{TN+FP}).$$ 


Balanced accuracy is used for imbalanced data as an estimate for the accuracy if the data was balanced, so the true performance of our models on balanced data will be close to the balanced accuracy.

In [40]:
#data frame used to store all of the model results
model_results_df = pd.DataFrame(columns=['Feature Extraction Train Time','Feature Extraction Test Time',
                                         'Train Time','Prediction Time','Accuracy','Balanced Accuracy','AUC'])

The baseline model is a GBM fitted on the initial pairwise fiducial features. The parameters were chosen from a grid search with AUC scoring. Feature extraction times for all models are already known from the previous step.

In [41]:
#grid search for optimal parameters
#params = {'learning_rate':[0.01,0.05,0.1,0.5], 'max_depth': [1,2,3], 'n_estimators':[50,100,150]}
#gscv = GridSearchCV(GradientBoostingClassifier(),params,cv=3,scoring='roc_auc').fit(feature_train_initial,label_train_initial)
#gscv.best_params_
# output: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 150}

In [42]:
from ipynb.fs.full.train_gbm import train_gbm
from ipynb.fs.full.test_model import test_model
from ipynb.fs.full.compute_metrics import compute_metrics

if run_baseline == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_initial))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_initial))
    
    [train_time,baseline] = train_gbm(feature_train_initial,label_train_initial,
                                      learning_rate=0.1,max_depth=3,n_estimators=150)
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(baseline,feature_test_initial)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test_initial,label_test_initial,test_preds,baseline)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    #save baseline model
    pickle.dump(baseline,open("../output/baseline.p", "wb"))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_initial,
                    'Feature Extraction Test Time':tm_feature_test_initial,
                     'Train Time':train_time,
                     'Prediction Time':prediction_time,
                    'Accuracy':accuracy,
                    'AUC':auc,
                    'Balanced Accuracy':balanced_accuracy},name='Baseline')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 5.857164 seconds
Feature extraction time for test: 1.419747  seconds

Training time: 187.676153 seconds
Prediction time: 0.059905 seconds

Accuracy: 0.808333
Balanced Accuracy: 0.587563
AUC: 0.785287


### Step 7: Advanced Model (SMOTEBoost) ~70% Balanced Accuracy

For the advanced model, we decided to use SMOTEBoost, which is a modified version of XGBoost that uses SMOTE (Synethic Minority Oversampling Technique) (add reference). Our model also uses the improved features which do not double count distances, and the parameters were chosen from grid search with AUC scoring.

The idea of SMOTE is to modify the imbalanced training data by first randomly undersampling from the majority class and then creating new synthetic minority data that is close to the existing feature space. The modified SMOTE features then have equal number of data in each class.

We went with this model for a couple of reasons. First of all, it addresses the fact that the training data is imbalanced. It also has a higher AUC, accuracy, and balanced accuracy than the baseline GBM model. Finally, compared to the other candidates for the advanced model, it has the highest AUC from 10 fold cross validation.

In [44]:
from ipynb.fs.full.train_xgb import train_xgb
from ipynb.fs.full.test_model import test_model
from ipynb.fs.full.compute_metrics import compute_metrics

if run_advanced == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_SMOTE))
    print('Feature extraction time for test:  {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time, advanced] = train_xgb(feature_train_sm, label_train_sm, learning_rate=0.25, n_estimators=300,
                                      max_depth=3,min_child_weight=1,objective='binary:logistic',scale_pos_weight=4)
    print('\nTraining time: {:4f} seconds'.format(train_time))

         
    [prediction_time,test_preds] = test_model(advanced,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,advanced)
    balanced_accuracy = balanced_accuracy_score(label_test,test_preds)

    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    pickle.dump(advanced,open("../output/advanced.p", "wb"))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_SMOTE,
                    'Feature Extraction Test Time':tm_feature_test_improved,
                     'Train Time':train_time,
                     'Prediction Time':prediction_time,
                    'Accuracy':accuracy,
                    'AUC':auc,
                    'Balanced Accuracy':balanced_accuracy},name='Advanced (SMOTEBoost)')
    model_results_df.loc['Advanced (SMOTEBoost)'] = row    

Feature extraction time for train: 2.498043 seconds
Feature extraction time for test:  0.040353  seconds

Training time: 28.193406 seconds
Prediction time: 0.105752 seconds

Accuracy: 0.810000
Balanced Accuracy: 0.706697
AUC: 0.827371


### Optional Step: Remaining Models

These are the other models that were candidates for the advanced model. 
(To Do: Make them not run by default and have a table stored as a pickle file to load up here)

It takes about 30 minutes for the remaining models to finish if they are all set to run.

#### Baseline Model with Improved Features

In [23]:
#grid search for optimal parameters
#params = {'learning_rate':[0.01,0.05,0.1,0.5], 'max_depth': [1,2,3], 'n_estimators':[50,100,150]}
#gscv = GridSearchCV(GradientBoostingClassifier(),params,cv=3,scoring='roc_auc').fit(feature_train,label_train)
#gscv.best_params_
# output: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 150}

In [24]:
from ipynb.fs.full.train_xgb import train_xgb

if run_baseline_improved == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test:  {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time, baseline_improved] = train_gbm(feature_train,label_train,
                                                learning_rate=0.1,max_depth=3,n_estimators=150)
    
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(baseline_improved,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test_initial,label_test_initial,test_preds,baseline)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))

    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
                'Feature Extraction Test Time':tm_feature_test_improved,
                 'Train Time':train_time,
                 'Prediction Time':prediction_time,
                'Accuracy':accuracy,
                'AUC':auc,
                'Balanced Accuracy':balanced_accuracy},name='Baseline with Improved Features')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test:  0.038007  seconds

Training time: 192.445398 seconds
Prediction time: 0.032289 seconds

Accuracy: 0.816667
Balanced Accuracy: 0.610128
AUC: 0.785287


### Baseline Model With PCA

In [25]:
#params = {'learning_rate':[0.01,0.05,0.1,0.5], 'max_depth': [1,2,3], 'n_estimators':[50,100,150]}
#gscv = GridSearchCV(GradientBoostingClassifier(),params,cv=3).fit(feature_train_PCA,label_train)
#gscv.best_params_
#output: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 150}

In [26]:
if run_baseline_pca == True:

    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_PCA))
    print('Feature extraction time for test:  {:4f}  seconds'.format(tm_feature_test_PCA))
    
    [train_time, baseline_PCA] = train_gbm(feature_train_PCA,label_train_PCA,
                                                learning_rate=0.1,max_depth=2,n_estimators=150)
    
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(baseline_PCA,feature_test_PCA)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test_PCA,label_test_PCA,test_preds,baseline_PCA)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_PCA,
            'Feature Extraction Test Time':tm_feature_test_PCA,
             'Train Time':train_time,
             'Prediction Time':prediction_time,
            'Accuracy':accuracy,
            'AUC':auc,
            'Balanced Accuracy':balanced_accuracy},name='Baseline with PCA')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 6.889987 seconds
Feature extraction time for test:  0.119862  seconds

Training time: 1.294838 seconds
Prediction time: 0.002415 seconds

Accuracy: 0.790000
Balanced Accuracy: 0.535616
AUC: 0.692814


### KNN Model

In [27]:
#params = {'n_neighbors':list(range(5,55,5))}
#gscv = GridSearchCV(KNeighborsClassifier(),params,cv=5).fit(feature_train,label_train)
#gscv.best_params_
#output: {'n_neighbors': 25}

In [28]:
from ipynb.fs.full.train_knn import train_knn

if run_knn == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test:  {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time, knn] = train_knn(feature_train,label_train,n_neighbors=25)
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(knn,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,knn)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='KNN')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test:  0.038007  seconds

Training time: 0.381865 seconds
Prediction time: 6.070336 seconds

Accuracy: 0.791667
Balanced Accuracy: 0.516514
AUC: 0.674527


### SMOTE KNN

In [29]:
#params = {'n_neighbors':list(range(5,55,5))}
#gscv = GridSearchCV(KNeighborsClassifier(),params,cv=5).fit(feature_train_sm,label_train_sm)
#gscv.best_params_
#output: {'n_neighbors': 5}

In [30]:
if run_knn_smote == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_SMOTE))
    print('Feature extraction time for test:  {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time, knn] = train_knn(feature_train_sm,label_train_sm,n_neighbors=5)
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(knn,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,knn)
    balanced_accuracy = balanced_accuracy_score(label_test,test_preds)

    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_SMOTE,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='SMOTE KNN')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 2.512055 seconds
Feature extraction time for test:  0.038007  seconds

Training time: 0.762404 seconds
Prediction time: 8.466854 seconds

Accuracy: 0.606667
Balanced Accuracy: 0.638211
AUC: 0.683691


### XGBoost Model

In [31]:
if run_xgboost == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test:  {:4f} seconds'.format(tm_feature_test_improved))
    
    [train_time, xgb] = train_xgb(feature_train, label_train, learning_rate=0.1, n_estimators=200,
                                      max_depth=3,min_child_weight=1,objective='binary:logistic',scale_pos_weight=4)
    print('\nTraining time: {:4f} seconds'.format(train_time))

         
    [prediction_time,test_preds] = test_model(xgb,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))

    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,xgb)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='XGBoost')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test:  0.038007 seconds

Training time: 89.336492 seconds
Prediction time: 0.087405 seconds

Accuracy: 0.820000
Balanced Accuracy: 0.620882
AUC: 0.809126


### Random Forest Model

In [32]:
from ipynb.fs.full.train_random_forest import train_random_forest

if run_random_forest==True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time, rf_model] = train_random_forest(feature_train,label_train,n_estimators=100,criterion='gini',min_samples_leaf=1,max_features='sqrt')
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(rf_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,rf_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='Random Forest')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 7.629287 seconds
Prediction time: 0.039910 seconds

Accuracy: 0.810000
Balanced Accuracy: 0.562701
AUC: 0.766485


### LDA Model

In [33]:
from ipynb.fs.full.train_lda import train_lda

if run_LDA==True: 
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time, lda_model] = train_lda(feature_train, label_train,solver='eigen', shrinkage=.1, n_components=1)
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(lda_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,lda_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='LDA')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 7.088265 seconds
Prediction time: 0.018853 seconds

Accuracy: 0.821667
Balanced Accuracy: 0.636339
AUC: 0.786203


### Logistic Model

In [34]:
#grid={"C":[0.001,0.01,0.1,0.25,0.5,1,10]}
#cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
#gscv = GridSearchCV(LogisticRegression(dual=False, fit_intercept=True,
#                   intercept_scaling=1, max_iter=1200000,
#                   multi_class='multinomial', penalty='l2',
#                   solver='lbfgs', tol=0.0001),grid,cv=cv,return_train_score=True)
#gscv.fit(feature_train,label_train)
#gscv.best_params_
#output: {'C': 0.01}

In [35]:
from ipynb.fs.full.train_logistic import train_logistic

if run_logistic==True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    weights = {0:1.0, 1:1.0}
    [train_time,lr_model] = train_logistic(feature_train,label_train,
                                           C=0.01, dual=False, fit_intercept=True,
                                           intercept_scaling=1, max_iter=1200000,
                                           multi_class='multinomial', penalty='l2',
                                           solver='lbfgs', tol=0.0001,class_weight=weights)
    
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(lr_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,lr_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='Logistic Regression')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 119.274133 seconds
Prediction time: 0.024061 seconds

Accuracy: 0.821667
Balanced Accuracy: 0.699697
AUC: 0.822310


### Weighted Logistic Model

In [36]:
#grid={"C":[0.001,0.01,0.1,0.25,0.5,1,10]}
#weights = {0:80.0, 1:20.0}
#cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
#gscv = GridSearchCV(LogisticRegression(dual=False, fit_intercept=True,
#                   intercept_scaling=1, max_iter=1200000,
#                   multi_class='multinomial', penalty='l2',
#                   solver='lbfgs', tol=0.0001,class_weight=weights),grid,cv=cv,return_train_score=True)
#gscv.fit(feature_train,label_train)
#gscv.best_params_
#output: {'C': 0.001}

In [37]:
from ipynb.fs.full.train_logistic import train_logistic
if run_weighted_logistic==True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    weights = {0:80.0, 1:20.0}
    [train_time,lr_model] = train_logistic(feature_train,label_train,
                                           C=0.001, dual=False, fit_intercept=True,
                                           intercept_scaling=1, max_iter=120000000000,
                                           multi_class='multinomial', penalty='l2',
                                           solver='lbfgs', tol=0.0001,class_weight=weights)
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(lr_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,lr_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='Weighted Logistic')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 145.245618 seconds
Prediction time: 0.023004 seconds

Accuracy: 0.830000
Balanced Accuracy: 0.679063
AUC: 0.822843


STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### SVM

In [38]:
# #grid search with cv 3 to find the best performed parameters
# param= {'C': [0.00001,0.0001,0.001,0.01,1,10],
#        'kernel':['linear', 'rbf', 'poly'],
#        'degree':[2,3,4]}

# gscv = GridSearchCV(SVC(random_state = 2020), param, cv=3, return_train_score=True)
# gscv.fit(feature_train,label_train)
# gscv.best_params_
# #output: {'C': 10, 'degree': 4, 'kernel': 'poly'}

In [39]:
from ipynb.fs.full.train_svm import train_svm

#improved svm using parameters from grid search
if run_svm==True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    weights = {0:1.0, 1:1.0}
    [train_time,svm_model] = train_svm(feature_train,label_train,
                                      C=10, kernel='poly', degree=4,
                                      probability=True,class_weight=weights)
    
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(svm_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,svm_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
         'Train Time':train_time,
         'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='SVM')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 85.878458 seconds
Prediction time: 1.919353 seconds

Accuracy: 0.855000
Balanced Accuracy: 0.683400
AUC: 0.828070


### SVM With PCA

In [40]:
# #grid search with cv 3 to find the best performed parameters
# param= {'C': [0.001,0.01,1,10,15,20],
#        'kernel':['linear', 'rbf', 'poly'],
#        'degree':[2,3,4]}

# gscv = GridSearchCV(SVC(random_state = 2020), param, cv=3, return_train_score=True)
# gscv.fit(feature_train_PCA,label_train)
# gscv.best_params_
# #output: {'C': 10, 'degree': 2, 'kernel': 'rbf'}

In [41]:
#improved svm with PCA
from ipynb.fs.full.train_svm import train_svm

if run_svm==True:    
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_PCA))
    print('Feature extraction time for test:  {:4f}  seconds'.format(tm_feature_test_PCA))
    
    weights = {0:1.0, 1:1.0}
    [train_time, svm_PCA] = train_svm(feature_train_PCA,label_train_PCA,
                                      C=10,degree=2,kernel='rbf',probability=True,
                                     class_weight=weights)
    
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(svm_PCA,feature_test_PCA)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test_PCA,label_test_PCA,test_preds,svm_PCA)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_PCA,
        'Feature Extraction Test Time':tm_feature_test_PCA,
        'Train Time':train_time,
        'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='SVM with PCA')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 6.889987 seconds
Feature extraction time for test:  0.119862  seconds

Training time: 0.835471 seconds
Prediction time: 0.020878 seconds

Accuracy: 0.786667
Balanced Accuracy: 0.585341
AUC: 0.745118


### Weighted SVM

In [42]:
# weights = {0:80.0, 1:20.0}
# params= {'C': [1,10,15,20],
#        'kernel':['linear', 'rbf', 'poly'],
#        'degree':[2,3,4]}
# cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
# gscv = GridSearchCV(SVC(class_weight=weights,random_state = 2020,probability=True), params, cv=3, scoring='roc_auc',verbose=True)
# gscv.fit(feature_train,label_train)
# gscv.best_params_
# #output: output: {'C': 10, 'degree': 4, 'kernel': 'poly'}

In [43]:
from ipynb.fs.full.train_svm import train_svm

#improved svm using parameters from grid search
if run_svm==True:  
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    weights = {0:80.0, 1:20.0}
    [train_time,svm_model] = train_svm(feature_train,label_train,
                                      C=10, kernel='poly', degree=4,
                                      probability=True,class_weight=weights)
    
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(svm_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,svm_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
        'Train Time':train_time,
        'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='Weighted SVM')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 403.409784 seconds
Prediction time: 1.722822 seconds

Accuracy: 0.836667
Balanced Accuracy: 0.717851
AUC: 0.837193


### Naive Bayes

In [44]:
from ipynb.fs.full.train_naive_bayes import train_naive_bayes

if run_naivebayes == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time,gnb] = train_naive_bayes(feature_train,label_train)
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(gnb,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,gnb)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
        'Train Time':train_time,
        'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='Naive Bayes')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 0.127418 seconds
Prediction time: 0.039222 seconds

Accuracy: 0.663333
Balanced Accuracy: 0.628073
AUC: 0.660369


### Lasso

In [45]:
from ipynb.fs.full.train_lasso import train_lasso

if run_lasso==True:  
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    weights = {0:1.0, 1:1.0}
    [train_time, lasso_model] = train_lasso(feature_train,label_train,
                                           penalty='l1',solver='liblinear',
                                           class_weight=weights)

    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(lasso_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,lasso_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))

    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
        'Train Time':train_time,
        'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='Lasso')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 212.432784 seconds
Prediction time: 0.017244 seconds

Accuracy: 0.825000
Balanced Accuracy: 0.693171
AUC: 0.826389


### Weighted Lasso

In [46]:
# weights = {0:80.0, 1:20.0}
# param= {'solver':['liblinear', 'saga']}
# cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
# gscv = GridSearchCV(LogisticRegression(penalty='l1', class_weight=weights), params, cv=3, scoring='roc_auc',verbose=True)
# gscv.fit(feature_train,label_train)
# gscv.best_params_

In [47]:
if run_weighted_lasso==True: 
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_improved))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    weights = {0:80.0, 1:20.0}
    [train_time, lasso_model] = train_lasso(feature_train,label_train,
                                           penalty='l1',solver='liblinear',
                                           class_weight=weights)

    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(lasso_model,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))
    
    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,lasso_model)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_improved,
        'Feature Extraction Test Time':tm_feature_test_improved,
        'Train Time':train_time,
        'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='Weighted Lasso')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 0.193292 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 61.974368 seconds
Prediction time: 0.017205 seconds

Accuracy: 0.831667
Balanced Accuracy: 0.622522
AUC: 0.819164


### SMOTE Bagging

In [48]:
#params = {'n_estimators':[25,50,75,100]}
#cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)
#gscv = GridSearchCV(BaggingClassifier(),params,cv=cv,scoring='roc_auc').fit(feature_train_sm,label_train_sm)
#gscv.best_params_
#output: {{'n_estimators': 100}}

In [49]:
from ipynb.fs.full.train_bagging import train_bagging

if run_bagging_smote == True:
    
    print('Feature extraction time for train: {:4f} seconds'.format(tm_feature_train_SMOTE))
    print('Feature extraction time for test: {:4f}  seconds'.format(tm_feature_test_improved))
    
    [train_time, smote_bagging] = train_bagging(feature_train_sm,label_train_sm,n_estimators=100) 
    print('\nTraining time: {:4f} seconds'.format(train_time))
    
    [prediction_time,test_preds] = test_model(smote_bagging,feature_test)
    print('Prediction time: {:4f} seconds'.format(prediction_time))

    [accuracy, balanced_accuracy, auc] = compute_metrics(feature_test,label_test,test_preds,smote_bagging)
    balanced_accuracy = balanced_accuracy_score(label_test,test_preds)
    print('\nAccuracy: {:4f}'.format(accuracy))
    print('Balanced Accuracy: {:4f}'.format(balanced_accuracy))
    print('AUC: {:4f}'.format(auc))
    
    row = pd.Series({'Feature Extraction Train Time':tm_feature_train_SMOTE,
        'Feature Extraction Test Time':tm_feature_test_improved,
        'Train Time':train_time,
        'Prediction Time':prediction_time,
        'Accuracy':accuracy,
        'AUC':auc,
        'Balanced Accuracy':balanced_accuracy},name='SMOTE Bagging')
    model_results_df = model_results_df.append(row)

Feature extraction time for train: 2.512055 seconds
Feature extraction time for test: 0.038007  seconds

Training time: 668.403124 seconds
Prediction time: 0.653561 seconds

Accuracy: 0.806667
Balanced Accuracy: 0.632585
AUC: 0.794535


### 10-fold Cross Validation 

In [51]:
# weighted lasso
#print("Cross Validation Score: ", np.mean(cross_val_score(lasso_w, feature_train, label_train, cv=10, scoring='roc_auc')))
# output:0.8445833333333332
# roc_auc output:0.8010359440647241

In [52]:
# weighted svm
#print("Cross Validation Score: ", np.mean(cross_val_score(svm, feature_train, label_train, cv=10, scoring='roc_auc')))
# output:0.8195833333333333
# roc_auc output:0.8117648633580459

In [53]:
# weighted Logistic
#print("Cross Validation Score: ", np.mean(cross_val_score(lr, feature_train, label_train, cv=10, scoring='roc_auc')))
# output:0.8300000000000001

In [54]:
# XGBoost with SMOTE
#print("Cross Validation Score: ", np.mean(cross_val_score(xgb_sm, feature_train, label_train, cv=10, scoring='roc_auc')))
# output:0.8300000000000001
# roc_auc output:0.8387998261113715

### Table

In [55]:
model_results_df

Unnamed: 0,Feature Extraction Train Time,Feature Extraction Test Time,Train Time,Prediction Time,Accuracy,Balanced Accuracy,AUC
Baseline,5.376008,1.175714,186.140792,0.055967,0.808333,0.587563,0.785287
Advanced (SMOTEBoost),2.512055,0.038007,206.047342,0.077489,0.821667,0.670898,0.821778
Baseline with Improved Features,0.193292,0.038007,192.445398,0.032289,0.816667,0.610128,0.785287
Baseline with PCA,6.889987,0.119862,1.294838,0.002415,0.79,0.535616,0.692814
KNN,0.193292,0.038007,0.381865,6.070336,0.791667,0.516514,0.674527
SMOTE KNN,2.512055,0.038007,0.762404,8.466854,0.606667,0.638211,0.683691
XGBoost,0.193292,0.038007,89.336492,0.087405,0.82,0.620882,0.809126
Random Forest,0.193292,0.038007,7.629287,0.03991,0.81,0.562701,0.766485
LDA,0.193292,0.038007,7.088265,0.018853,0.821667,0.636339,0.786203
Logistic Regression,0.193292,0.038007,119.274133,0.024061,0.821667,0.699697,0.82231


In [56]:
model_results_df = model_results_df.applymap(lambda x: round(x,3))
model_results_df['Train Time'] = [str(x)+' s' for x in list(model_results_df['Train Time'])]
model_results_df['Prediction Time'] = [str(x)+' s' for x in list(model_results_df['Prediction Time'])]
model_results_df['Feature Extraction Train Time'] = [str(x)+' s' for x in list(model_results_df['Feature Extraction Train Time'])]
model_results_df['Feature Extraction Test Time'] = [str(x)+' s' for x in list(model_results_df['Feature Extraction Test Time'])]

In [57]:
model_results_df

Unnamed: 0,Feature Extraction Train Time,Feature Extraction Test Time,Train Time,Prediction Time,Accuracy,Balanced Accuracy,AUC
Baseline,5.376 s,1.176 s,186.141 s,0.056 s,0.808,0.588,0.785
Advanced (SMOTEBoost),2.512 s,0.038 s,206.047 s,0.077 s,0.822,0.671,0.822
Baseline with Improved Features,0.193 s,0.038 s,192.445 s,0.032 s,0.817,0.61,0.785
Baseline with PCA,6.89 s,0.12 s,1.295 s,0.002 s,0.79,0.536,0.693
KNN,0.193 s,0.038 s,0.382 s,6.07 s,0.792,0.517,0.675
SMOTE KNN,2.512 s,0.038 s,0.762 s,8.467 s,0.607,0.638,0.684
XGBoost,0.193 s,0.038 s,89.336 s,0.087 s,0.82,0.621,0.809
Random Forest,0.193 s,0.038 s,7.629 s,0.04 s,0.81,0.563,0.766
LDA,0.193 s,0.038 s,7.088 s,0.019 s,0.822,0.636,0.786
Logistic Regression,0.193 s,0.038 s,119.274 s,0.024 s,0.822,0.7,0.822


### References