# Modeling

# Neural Network Classifier 

Going to set up our dataset by scaling and handling the imbalanced data.

We then will fit a basic NeuralNetworkClassifier to see if we can find some obvious issues as well as noting what a "first try" model looks like as a baseline, beyond our null model.

After that "first try" model, we will fix any obvious issues then iterate through GridSearches to optimize hyperparameters based on general knowledge of NN and ranges established through previous GridSearches.

## Imports

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, GRU
from tensorflow.keras.layers import Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from imblearn.under_sampling import NearMiss, RandomUnderSampler
from imblearn.over_sampling import SMOTE, RandomOverSampler
from collections import Counter

## Read in Data

Import data and observe the basics

In [3]:
csv_file = "../drugs_2020_simply_imputed.csv"
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(16829, 67)


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,ACCGDLN,AGE,ALTDUM,AMENDYR,AMTTOTAL,CASETYPE,CITWHERE,COMBDRG2,...,TYPEMONY,TYPEOTHS,UNIT1,MWGT1,WGT1,XCRHISSR,XFOLSOR,XMAXSOR,XMINSOR,SENTRNGE
0,0,6,1.0,20.0,0,2018.0,0,1.0,211.0,6.0,...,1.0,0,1.0,63560990.0,85104.433315,1.0,17.0,30.0,24.0,8.0
1,1,14,1.0,64.0,0,2018.0,0,1.0,211.0,1.0,...,1.0,0,1.0,1193400.0,5967.0,3.0,27.0,108.0,87.0,0.0
2,2,15,1.0,28.0,0,2018.0,0,1.0,211.0,3.0,...,1.0,0,2.0,2000000.0,2000.0,6.0,27.0,162.0,130.0,2.0
3,3,26,2.0,55.0,0,2018.0,0,1.0,211.0,77.0,...,1.0,0,1.0,10300.0,4.12,5.0,13.0,37.0,30.0,0.0
4,4,29,1.0,30.0,0,2018.0,0,1.0,211.0,6.0,...,1.0,0,1.0,169200.0,84.6,6.0,25.0,137.0,110.0,2.0


> **16829 rows and 67 columns**
>> **However some of these columns are dropped and one is our target columns, PRISDUM**

In [7]:
df.columns

Index(['ACCGDLN', 'AGE', 'ALTDUM', 'AMTTOTAL', 'CASETYPE', 'CITWHERE',
       'COMBDRG2', 'CRIMHIST', 'DISPOSIT', 'DISTRICT', 'DRUGMIN', 'DSPLEA',
       'EDUCATN', 'INTDUM', 'METHMIN', 'MONRACE', 'MONSEX', 'MWEIGHT',
       'NEWCIT', 'NEWCNVTN', 'NEWEDUC', 'NEWRACE', 'NODRUG', 'NUMDEPEN',
       'OFFGUIDE', 'PRISDUM', 'PROBATN', 'PROBDUM', 'QUARTER', 'REAS1',
       'REAS2', 'REAS3', 'REGSXMIN', 'RELMIN', 'RESTDET1', 'RESTDUM', 'SAFE',
       'SAFETY', 'SENSPCAP', 'SENSPLT0', 'SENTIMP', 'SMAX1', 'SMIN1',
       'SOURCES', 'STATMAX', 'STATMIN', 'SUPERMAX', 'SUPERMIN', 'SUPREL',
       'TIMSERVC', 'TOTCHPTS', 'TOTREST', 'TOTUNIT', 'TYPEMONY', 'TYPEOTHS',
       'UNIT1', 'MWGT1', 'WGT1', 'XCRHISSR', 'XFOLSOR', 'XMAXSOR', 'XMINSOR',
       'SENTRNGE'],
      dtype='object')

- Drop the index columns created from saving a DataFrame to a csv.
- Also drop the columns we have identified as either too correlated or not useful for our model.

In [6]:
df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'AMENDYR', 'SUPRDUM'], inplace=True)

## Train Test Split

Set our X and Y

In [8]:
X = df.drop(columns='PRISDUM')
y = df['PRISDUM']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

## Scale Data for Neural Network Classifier

In [10]:
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

### Null Model

In [12]:
y.value_counts(normalize=True)

1    0.955196
0    0.044804
Name: PRISDUM, dtype: float64

> We see that we have a very imblanced dataset.

### Model on Imblanaced Data

In [21]:
model = Sequential()
model.add(Dense(17, input_shape=(X.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(40, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_sc,
    y_train,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 00015: early stopping


#### Analysis:
Results: loss: 0.0073 - accuracy: 0.9976 - val_loss: 0.0070 - val_accuracy: 0.9988

As expected, a very accurate model. However, not a massive improvement over our null model and therefore likely suffering because of that imbalance.

---

## Balance Imbalanced Data

### Under Sample Majority

In [17]:
nm = RandomUnderSampler()
X_train_under, y_train_under = nm.fit_resample(X_train_sc, y_train)

In [19]:
y_train_under.value_counts(normalize=True)

0    0.5
1    0.5
Name: PRISDUM, dtype: float64

In [18]:
model = Sequential()
model.add(Dense(17, input_shape=(X.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(40, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_under,
    y_train_under,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 00019: early stopping


#### Analysis 

Results: loss: 0.0301 - accuracy: 0.9876 - val_loss: 0.1276 - val_accuracy: 0.9903

This is still very wrong. There must be a variable giving it away.

### Over Sample Minority

In [25]:
ros = RandomOverSampler()

X_train_over, y_train_over = ros.fit_resample(X_train_sc, y_train)

In [26]:
model = Sequential()
model.add(Dense(17, input_shape=(X.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(40, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_over,
    y_train_over,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 00013: early stopping


#### Analysis 

Results: loss: 0.0081 - accuracy: 0.9983 - val_loss: 0.0173 - val_accuracy: 0.9983

This is still very wrong. There must be a variable giving it away.

### Over Sample Minority

In [27]:
smo = SMOTE()

X_train_smote, y_train_smote = smo.fit_resample(X_train_sc, y_train)

In [28]:
model = Sequential()
model.add(Dense(17, input_shape=(X.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(40, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_smote,
    y_train_smote,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 00013: early stopping


#### Analysis 

Results: loss: 0.0049 - accuracy: 0.9990 - val_loss: 0.0350 - val_accuracy: 0.9986

This is still very wrong. There must be a variable giving it away.

# Very sure we have a variable giving this away, as our attempts to balance the dataset have proven ineffective as we are asymptotic to 100%

---

# Fix Inaccurate Accuracy

In [29]:
# df.drop(columns= [], inplace=True)

## GridSearch

I modified the code from "GridSearch with keras" by Riley Dallas and Adi bronshtein; shown to me by Eric Bayless. It is almost identical.

**Note:**

Issue with ```early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)```

I changed ```val_loss``` for ```loss``` on previous attempts.

In [None]:
# Add an arugment of number of layers to the function (and loop through it) 
def model_fn_deep(hidden_neurons=32, hidden_layers=5, dropout=0.5):
    model=Sequential()
    
    for layer in range(hidden_layers):
        if layer == 0:
            model.add(Dense(hidden_neurons, input_shape=(X.shape[1],), activation='relu'))
            model.add(Dropout(dropout))
        else:
            model.add(Dense(hidden_neurons, activation='relu'))
            model.add(Dropout(dropout))
            
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(loss='bce', metrics=['acc'], optimizer='adam')
    
    return model

nn_deep = KerasClassifier(build_fn = model_fn_deep, batch_size=32, verbose=0)

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1) ######## Can be an issue ##########

params_deep = {
    'hidden_neurons': [16,32,64,128,256,512,1024],
    'hidden_layers': [2,3,4,5,6,7,8,9,10],
    'dropout': [0.1,0.2,0.3,0.4,0.5],
    'epochs': [10,20,50,100],
    'callbacks': [early_stop]
}

gs_deep = GridSearchCV(nn_deep, param_grid=params_deep, cv=5, n_jobs=-1)

gs_deep.fit(X_train_sc, y_train)
print(gs_deep.best_score_)
gs_deep.best_params_