# Model Building - Logistic Regression

This notebook implements the model building for my thesis comparing different supervised machine learning techniques (logistic regression, random forests, support vector machines and neural networks). The main focus surrounds when each method results in better performance as measured by discrimination (AUC) and calibration (calibration plots). This project considers various factors including the number of features, number of datapoints and 'predictor strength'. This notebook fits the logistic regression to a small number of variables and outputs prediction and results to a pandas dataframe, where model evaliation and results will be undertaken in a following notebook.

In [1]:
import glob
import numpy as np
import os
import pandas as pd
import sklearn

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.model_selection import StratifiedKFold, train_test_split

np.random.seed(21)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Logistic regression

There are no hyperparameters to tune for non-penalised logistic regression, so just a case of fitting the model. At this stage linearity and additivity is assumed, though this assumption may be relaxed later.

In [2]:
def logistic(X_train, X_test, y_train, y_test):
    lr = LogisticRegression(max_iter = 10000, solver = 'lbfgs')
    lr.fit(X_train, y_train)
    y_pred = lr.predict_proba(X_test)[:, 1]
    fpr, tpr, threshold = roc_curve(y_test, y_pred)
    auc_lr = auc(fpr, tpr)
    y_test = pd.DataFrame(y_test)
    pred_df = pd.DataFrame(y_pred)
    pred_df['index'] = y_test.index
    pred_df.set_index('index', inplace = True)
    pred_df.rename(columns = {0: "prob"}, inplace = True)
    return pred_df.join(y_test, how = 'left').reset_index(drop = True), auc_lr

In [3]:
current_wd = os.getcwd()
base_wd = current_wd
base_wd = base_wd.replace("\\", "/")
path = base_wd + "/data/bootstraps"
os.chdir(path)
all_filenames = [i for i in glob.glob('*.{}'.format('csv'))]

In [4]:
%%time
#fitting logistic regression to each bootstrap
np.random.seed(21)
i = 0
dataframes = []
for filename in all_filenames:
    i += 1
    print(str(i) + " of " + str(len(all_filenames)))
    mf = pd.read_csv(filename)
    X_train, X_test, y_train, y_test = train_test_split(mf.drop('deceased', axis = 1), 
                                                        mf['deceased'], test_size = 0.2, 
                                                        random_state = 21, 
                                                        stratify = mf['deceased'])
    preds, lr_auc = logistic(X_train,
                             X_test, 
                             y_train, 
                             y_test)
    preds.reset_index(drop = True, inplace = True)
    dataframes.append(preds)
    print(lr_auc)
lr_predictions = pd.concat(dataframes, axis = 1)
lr_predictions.to_csv('./predictions/lr_predictions.csv', index = False)

1 of 999
0.8773371510379384
2 of 999
0.8535745047372955
3 of 999
0.8799918178352595
4 of 999
0.8747722018223854
5 of 999
0.8855999696601942
6 of 999
0.885579864436407
7 of 999
0.8612897265336289
8 of 999
0.8564642748786406
9 of 999
0.8754750175932442
10 of 999
0.8652438561893203
11 of 999
0.8790464461646726
12 of 999
0.8539643682850537
13 of 999
0.8534431316386204
14 of 999
0.8628598331988162
15 of 999
0.8944618327247436
16 of 999
0.8833887657058389
17 of 999
0.8688938852377109
18 of 999
0.8702025182038835
19 of 999
0.8849627820765542
20 of 999
0.868942381794279
21 of 999
0.8590482793854687
22 of 999
0.8615480056928073
23 of 999
0.8536839032222748
24 of 999
0.8537526547330097
25 of 999
0.8881830305651672
26 of 999
0.8881929046563193
27 of 999
0.8861793212580179
28 of 999
0.8797671188975537
29 of 999
0.8723376380310688
30 of 999
0.9165791113053342
31 of 999
0.8735953733269284
32 of 999
0.8763935249323577
33 of 999
0.8608618079584774
34 of 999
0.8675383040048543
35 of 999
0.8829822616407

0.8579637841832963
281 of 999
0.8603806295776842
282 of 999
0.8923300431568603
283 of 999
0.8788356000954427
284 of 999
0.8630128292306996
285 of 999
0.8658339197748064
286 of 999
0.8477605430825242
287 of 999
0.8943661971830986
288 of 999
0.8721345488262782
289 of 999
0.8775032753134944
290 of 999
0.8659249010338599
291 of 999
0.8640697087658593
292 of 999
0.8532209075660727
293 of 999
0.8838784183296379
294 of 999
0.8569133456351501
295 of 999
0.8558851340806228
296 of 999
0.8797593699733149
297 of 999
0.8670524834744749
298 of 999
0.8493394832295912
299 of 999
0.8924901334689688
300 of 999
0.8569019447844417
301 of 999
0.8632722694595483
302 of 999
0.8906344811893205
303 of 999
0.8939578713968958
304 of 999
0.8813648293963254
305 of 999
0.8512658507880042
306 of 999
0.8784606784547941
307 of 999
0.8557932314184257
308 of 999
0.8829113629812834
309 of 999
0.8630882108595465
310 of 999
0.8817983481377591
311 of 999
0.8899102744546095
312 of 999
0.8767453316018091
313 of 999
0.85859283

0.8862636966551326
557 of 999
0.8512471452241279
558 of 999
0.8585365853658538
559 of 999
0.8948564025836587
560 of 999
0.8982169253510717
561 of 999
0.8658642944357229
562 of 999
0.8582625917006815
563 of 999
0.9068920699355482
564 of 999
0.8871109002663942
565 of 999
0.885522368008598
566 of 999
0.85920846071222
567 of 999
0.855932285368803
568 of 999
0.8767759191259803
569 of 999
0.9052293544502427
570 of 999
0.8440004897943806
571 of 999
0.8774562531154079
572 of 999
0.8836012564671101
573 of 999
0.8908384802545386
574 of 999
0.8745398294270474
575 of 999
0.8572209036765035
576 of 999
0.8715194579856234
577 of 999
0.8746548149250123
578 of 999
0.8710710484425481
579 of 999
0.8690783807062877
580 of 999
0.8415742793791574
581 of 999
0.8970248027912621
582 of 999
0.8794333910034602
583 of 999
0.8980959198282032
584 of 999
0.8586576330937232
585 of 999
0.8858425132598939
586 of 999
0.8781690140845071
587 of 999
0.8754266535194175
588 of 999
0.8423230742479569
589 of 999
0.889639639639

0.8592503022974607
832 of 999
0.8741130820399113
833 of 999
0.8675882455225179
834 of 999
0.897295236678413
835 of 999
0.8474318212350544
836 of 999
0.892513569155905
837 of 999
0.8747060831310678
838 of 999
0.8722405153901218
839 of 999
0.8442453869846561
840 of 999
0.8823518745183372
841 of 999
0.8377010012135923
842 of 999
0.8973187196601942
843 of 999
0.859062212345771
844 of 999
0.8609124973268496
845 of 999
0.8939661829858148
846 of 999
0.8899213027894655
847 of 999
0.8631970979443774
848 of 999
0.8784598859786831
849 of 999
0.864968901699029
850 of 999
0.854615857642492
851 of 999
0.8682557280118255
852 of 999
0.8710379261931549
853 of 999
0.8518291257597008
854 of 999
0.8934240776557312
855 of 999
0.8862540967804127
856 of 999
0.8795742434904997
857 of 999
0.8907335550628234
858 of 999
0.8660384331116037
859 of 999
0.8750432376340367
860 of 999
0.890604967948718
861 of 999
0.8802077484559236
862 of 999
0.8849963045084996
863 of 999
0.878992274876052
864 of 999
0.862164568931486