# Modeling

# Neural Network Classifier 

Going to set up our dataset by scaling and handling the imbalanced data.

We then will fit a basic NeuralNetworkClassifier to see if we can find some obvious issues as well as noting what a "first try" model looks like as a baseline, beyond our null model.

After that "first try" model, we will fix any obvious issues then iterate through GridSearches to optimize hyperparameters based on general knowledge of NN and ranges established through previous GridSearches.

## Imports

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, GRU
from tensorflow.keras.layers import Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from imblearn.under_sampling import NearMiss, RandomUnderSampler
from imblearn.over_sampling import SMOTE, RandomOverSampler
from collections import Counter

## Read in Data

Import data and observe the basics

In [20]:
csv_file = "../Claire/data/drugs_2020_simply_imputed.csv"
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(16829, 64)


Unnamed: 0,accgdln,age,altdum,amttotal,casetype,citwhere,combdrg2,crimhist,disposit,district,...,typemony,typeoths,unit1,mwgt1,wgt1,xcrhissr,xfolsor,xmaxsor,xminsor,sentrnge
0,1.0,20.0,0,0,1.0,211.0,6.0,1.0,1,43,...,1.0,0,1.0,63560990.0,85104.433315,1.0,17.0,30.0,24.0,8.0
1,1.0,64.0,0,0,1.0,211.0,1.0,1.0,1,51,...,1.0,0,1.0,1193400.0,5967.0,3.0,27.0,108.0,87.0,0.0
2,1.0,28.0,0,0,1.0,211.0,3.0,1.0,1,48,...,1.0,0,2.0,2000000.0,2000.0,6.0,27.0,162.0,130.0,2.0
3,2.0,55.0,0,0,1.0,211.0,77.0,1.0,1,65,...,1.0,0,1.0,10300.0,4.12,5.0,13.0,37.0,30.0,0.0
4,1.0,30.0,0,0,1.0,211.0,6.0,1.0,1,87,...,1.0,0,1.0,169200.0,84.6,6.0,25.0,137.0,110.0,2.0


> **16829 rows and 67 columns**
>> **However some of these columns are dropped and one is our target columns, PRISDUM**

In [21]:
df.columns

Index(['accgdln', 'age', 'altdum', 'amttotal', 'casetype', 'citwhere',
       'combdrg2', 'crimhist', 'disposit', 'district', 'drugmin', 'dsplea',
       'educatn', 'intdum', 'methmin', 'monrace', 'monsex', 'mweight',
       'newcit', 'newcnvtn', 'neweduc', 'newrace', 'nodrug', 'numdepen',
       'offguide', 'prisdum', 'probatn', 'probdum', 'quarter', 'reas1',
       'reas2', 'reas3', 'regsxmin', 'relmin', 'restdet1', 'restdum', 'safe',
       'safety', 'senspcap', 'sensplt0', 'sentimp', 'smax1', 'smin1',
       'sources', 'statmax', 'statmin', 'supermax', 'supermin', 'suprdum',
       'suprel', 'timservc', 'totchpts', 'totrest', 'totunit', 'typemony',
       'typeoths', 'unit1', 'mwgt1', 'wgt1', 'xcrhissr', 'xfolsor', 'xmaxsor',
       'xminsor', 'sentrnge'],
      dtype='object')

- Drop the index columns created from saving a DataFrame to a csv.
- Also drop the columns we have identified as either too correlated or not useful for our model.

# Some EDA

## Train Test Split

Set our X and Y

In [22]:
X = df.drop(columns=['prisdum'])
y = df['prisdum']

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [24]:
X_no = X.drop(columns=['age', 'newrace', 'monsex', 'monrace', 'neweduc', 'newcnvtn'])

In [25]:
X_no_train, X_no_test, y_no_train, y_no_test = train_test_split(X_no, y, stratify=y)

## Scale Data for Neural Network Classifier

In [26]:
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [27]:
sc = StandardScaler()
X_no_train_sc = sc.fit_transform(X_no_train)
X_no_test_sc = sc.transform(X_no_test)

### Null Model

In [28]:
y.value_counts(normalize=True)

1    0.955196
0    0.044804
Name: prisdum, dtype: float64

> We see that we have a very imblanced dataset.

In [29]:
y_test.value_counts()

1    4019
0     189
Name: prisdum, dtype: int64

# Model on Imblanced Data

In [30]:
model = Sequential()
model.add(Dense(64, input_shape=(X_train_sc.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_sc,
    y_train,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 00013: early stopping


#### Analysis:
Results: loss: 0.0073 - accuracy: 0.9976 - val_loss: 0.0070 - val_accuracy: 0.9988

As expected, a very accurate model. However, not a massive improvement over our null model and therefore likely suffering because of that imbalance.

In [31]:
preds = np.round(model.predict(X_test_sc),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, preds).ravel()


df = pd.DataFrame(metrics.confusion_matrix(y_test, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,186,3
actual_prison,5,4014


In [32]:
misclass1 = []
for row_index, (input, prediction, label) in enumerate(zip (X_test_sc, preds, y_test)):
    if prediction != label:
        misclass1.append(row_index)
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 619 has been classified as  [0.] and should be  1
Row 1104 has been classified as  [0.] and should be  1
Row 1274 has been classified as  [1.] and should be  0
Row 2425 has been classified as  [0.] and should be  1
Row 2915 has been classified as  [1.] and should be  0
Row 3106 has been classified as  [1.] and should be  0
Row 3983 has been classified as  [0.] and should be  1
Row 4065 has been classified as  [0.] and should be  1


# Make dataframe out of misclassification list

In [38]:
X.iloc[misclass1][['age', 'newrace', 'monsex', 'monrace', 'neweduc', 'newcnvtn', 'educatn']]

Unnamed: 0,age,newrace,monsex,monrace,neweduc,newcnvtn,educatn
619,44.0,2.0,0.0,2,3.0,0,21.0
1104,48.0,1.0,1.0,1,5.0,0,23.0
1274,34.0,2.0,0.0,2,5.0,0,34.0
2425,38.0,2.0,0.0,2,6.0,0,16.0
2915,22.0,3.0,1.0,1,3.0,0,12.0
3106,59.0,2.0,0.0,2,1.0,0,6.0
3983,27.0,1.0,1.0,1,5.0,0,34.0
4065,50.0,3.0,0.0,1,5.0,1,34.0


**Imbalanced Model Misclassified Individuals**

We will use this as a baseline against other misclassifications in the attempt to see if there is a pattern.

Indices for misclassifications are: 619, 1104, 1274, 2425, 2914, 3106, 3983, 4065

## Without Demographic information

In [40]:
model = Sequential()
model.add(Dense(64, input_shape=(X_no_train_sc.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_no_train_sc,
    y_no_train,
    validation_data=(X_no_test_sc, y_no_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 00014: early stopping


#### Analysis:
Results: loss: 0.2942 - accuracy: 0.9568 - val_loss: 0.1175 - val_accuracy: 0.9665

As expected, a very accurate model. However, not a massive improvement over our null model and therefore likely suffering because of that imbalance.
> Interestingly, we do see a drop in accuracy after removing demographic information. However, the model is still suffering from the class imbalance and the accuracy was swinging from 95-99%.

In [41]:
preds = np.round(model.predict(X_no_test_sc),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_no_test, preds).ravel()



df = pd.DataFrame(metrics.confusion_matrix(y_no_test, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,182,7
actual_prison,2,4017


In [42]:
misclass1_no = []
for row_index, (input, prediction, label) in enumerate(zip (X_no_test_sc, preds, y_no_test)):
    if prediction != label:
        misclass1_no.append(row_index)
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 301 has been classified as  [1.] and should be  0
Row 927 has been classified as  [1.] and should be  0
Row 1103 has been classified as  [1.] and should be  0
Row 1648 has been classified as  [0.] and should be  1
Row 2621 has been classified as  [1.] and should be  0
Row 2932 has been classified as  [1.] and should be  0
Row 2949 has been classified as  [1.] and should be  0
Row 3362 has been classified as  [0.] and should be  1
Row 3878 has been classified as  [1.] and should be  0


In [43]:
X.iloc[misclass1_no][['age', 'newrace', 'monsex', 'monrace', 'neweduc', 'newcnvtn', 'educatn']]

Unnamed: 0,age,newrace,monsex,monrace,neweduc,newcnvtn,educatn
301,42.0,1.0,0.0,1,3.0,0,12.0
927,25.0,6.0,0.0,3,3.0,0,12.0
1103,35.0,2.0,0.0,2,3.0,0,33.0
1648,28.0,2.0,0.0,2,1.0,0,32.0
2621,37.0,2.0,0.0,2,1.0,0,10.0
2932,28.0,3.0,0.0,1,1.0,0,32.0
2949,30.0,2.0,0.0,2,3.0,0,21.0
3362,34.0,2.0,0.0,2,1.0,0,9.0
3878,25.0,1.0,1.0,1,3.0,0,12.0


**Imbalanced Model, No Demographic, Misclassified Individuals**

We will use this as a baseline against other misclassifications in the attempt to see if there is a pattern.

Indices for misclassifications are: 619, 1104, 1274, 2425, 2914, 3106, 3983, 4065

---

# Balance Imbalanced Data

### Under Sample Majority

In [19]:
nm = RandomUnderSampler()
X_train_under, y_train_under = nm.fit_resample(X_train_sc, y_train)

In [20]:
y_train_under.value_counts(normalize=True)

0    0.5
1    0.5
Name: prisdum, dtype: float64

In [21]:
model = Sequential()
model.add(Dense(64, input_shape=(X_train_under.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_under,
    y_train_under,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 00026: early stopping


#### Analysis 

Results: loss: 0.0301 - accuracy: 0.9876 - val_loss: 0.1276 - val_accuracy: 0.9903

This is still very wrong. There must be a variable giving it away.

In [22]:
preds = np.round(model.predict(X_test_sc),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, preds).ravel()



df = pd.DataFrame(metrics.confusion_matrix(y_test, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,187,2
actual_prison,30,3989


In [24]:
for row_index, (input, prediction, label) in enumerate(zip (X_test_sc, preds, y_test)):
    if prediction != label:
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 313 has been classified as  [0.] and should be  1
Row 624 has been classified as  [1.] and should be  0
Row 774 has been classified as  [0.] and should be  1
Row 785 has been classified as  [0.] and should be  1
Row 890 has been classified as  [0.] and should be  1
Row 953 has been classified as  [0.] and should be  1
Row 1191 has been classified as  [0.] and should be  1
Row 1216 has been classified as  [0.] and should be  1
Row 1370 has been classified as  [0.] and should be  1
Row 1633 has been classified as  [0.] and should be  1
Row 1665 has been classified as  [0.] and should be  1
Row 1770 has been classified as  [0.] and should be  1
Row 2021 has been classified as  [0.] and should be  1
Row 2101 has been classified as  [0.] and should be  1
Row 2214 has been classified as  [0.] and should be  1
Row 2230 has been classified as  [0.] and should be  1
Row 2296 has been classified as  [0.] and should be  1
Row 2476 has been classified as  [0.] and should be  1
Row 2547 has bee

## Without Demographic information

In [None]:
nm = RandomUnderSampler()
X_train_sc_no_dem_under, y_train_no_dem_under = nm.fit_resample(X_train_sc_no_dem, y_train_no_dem)

In [16]:
model = Sequential()
model.add(Dense(64, input_shape=(X_train_sc_no_dem_under.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_sc_no_dem_under,
    y_train_no_dem_under,
    validation_data=(X_test_sc_no_dem, y_test_no_dem),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 00053: early stopping


#### Analysis:
Results: loss: 0.2942 - accuracy: 0.9568 - val_loss: 0.1175 - val_accuracy: 0.9665

As expected, a very accurate model. However, not a massive improvement over our null model and therefore likely suffering because of that imbalance.
> Interestingly, we do see a drop in accuracy after removing demographic information. However, the model is still suffering from the class imbalance and the accuracy was swinging from 95-99%.

In [17]:
preds = np.round(model.predict(X_test_sc_no_dem),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_test_no_dem, preds).ravel()



df = pd.DataFrame(metrics.confusion_matrix(y_test_no_dem, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,9,180
actual_prison,200,3819


In [18]:
for row_index, (input, prediction, label) in enumerate(zip (X_test_sc_no_dem, preds, y_test_no_dem)):
    if prediction != label:
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 4 has been classified as  [1.] and should be  0
Row 10 has been classified as  [1.] and should be  0
Row 17 has been classified as  [0.] and should be  1
Row 22 has been classified as  [1.] and should be  0
Row 27 has been classified as  [1.] and should be  0
Row 28 has been classified as  [0.] and should be  1
Row 31 has been classified as  [0.] and should be  1
Row 37 has been classified as  [0.] and should be  1
Row 52 has been classified as  [0.] and should be  1
Row 61 has been classified as  [1.] and should be  0
Row 104 has been classified as  [0.] and should be  1
Row 106 has been classified as  [1.] and should be  0
Row 107 has been classified as  [1.] and should be  0
Row 112 has been classified as  [0.] and should be  1
Row 122 has been classified as  [1.] and should be  0
Row 137 has been classified as  [1.] and should be  0
Row 150 has been classified as  [1.] and should be  0
Row 155 has been classified as  [1.] and should be  0
Row 156 has been classified as  [0.] an

Row 3185 has been classified as  [1.] and should be  0
Row 3187 has been classified as  [0.] and should be  1
Row 3209 has been classified as  [0.] and should be  1
Row 3214 has been classified as  [0.] and should be  1
Row 3238 has been classified as  [0.] and should be  1
Row 3265 has been classified as  [1.] and should be  0
Row 3280 has been classified as  [1.] and should be  0
Row 3282 has been classified as  [0.] and should be  1
Row 3291 has been classified as  [0.] and should be  1
Row 3312 has been classified as  [0.] and should be  1
Row 3323 has been classified as  [0.] and should be  1
Row 3336 has been classified as  [1.] and should be  0
Row 3347 has been classified as  [0.] and should be  1
Row 3406 has been classified as  [1.] and should be  0
Row 3434 has been classified as  [0.] and should be  1
Row 3443 has been classified as  [0.] and should be  1
Row 3458 has been classified as  [0.] and should be  1
Row 3483 has been classified as  [0.] and should be  1
Row 3488 h

# Over Sample Minority

In [25]:
ros = RandomOverSampler()

X_train_over, y_train_over = ros.fit_resample(X_train_sc, y_train)

In [27]:
model = Sequential()
model.add(Dense(64, input_shape=(X_train_over.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_over,
    y_train_over,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 00020: early stopping


#### Analysis 

Results: loss: 0.0081 - accuracy: 0.9983 - val_loss: 0.0173 - val_accuracy: 0.9983

This is still very wrong. There must be a variable giving it away.

In [29]:
preds = np.round(model.predict(X_test_sc),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, preds).ravel()



df = pd.DataFrame(metrics.confusion_matrix(y_test, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,186,3
actual_prison,1,4018


In [30]:
for row_index, (input, prediction, label) in enumerate(zip (X_test_sc_no_dem, preds, y_test)):
    if prediction != label:
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 624 has been classified as  [1.] and should be  0
Row 2108 has been classified as  [1.] and should be  0
Row 2476 has been classified as  [0.] and should be  1
Row 2696 has been classified as  [1.] and should be  0


## Without Demographic information

In [37]:
ros = RandomOverSampler()

X_train_no_dem_over, y_train_no_dem_over = ros.fit_resample(X_train_sc_no_dem, y_train_no_dem)

In [39]:
model = Sequential()
model.add(Dense(64, input_shape=(X_train_no_dem_over.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_no_dem_over,
    y_train_no_dem_over,
    validation_data=(X_test_sc_no_dem, y_test_no_dem),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 00015: early stopping


#### Analysis:
Results: loss: 0.2942 - accuracy: 0.9568 - val_loss: 0.1175 - val_accuracy: 0.9665

As expected, a very accurate model. However, not a massive improvement over our null model and therefore likely suffering because of that imbalance.
> Interestingly, we do see a drop in accuracy after removing demographic information. However, the model is still suffering from the class imbalance and the accuracy was swinging from 95-99%.

In [40]:
preds = np.round(model.predict(X_test_sc_no_dem),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_test_no_dem, preds).ravel()



df = pd.DataFrame(metrics.confusion_matrix(y_test_no_dem, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,186,3
actual_prison,3,4016


In [18]:
for row_index, (input, prediction, label) in enumerate(zip (X_test_sc_no_dem, preds, y_test_no_dem)):
    if prediction != label:
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 4 has been classified as  [1.] and should be  0
Row 10 has been classified as  [1.] and should be  0
Row 17 has been classified as  [0.] and should be  1
Row 22 has been classified as  [1.] and should be  0
Row 27 has been classified as  [1.] and should be  0
Row 28 has been classified as  [0.] and should be  1
Row 31 has been classified as  [0.] and should be  1
Row 37 has been classified as  [0.] and should be  1
Row 52 has been classified as  [0.] and should be  1
Row 61 has been classified as  [1.] and should be  0
Row 104 has been classified as  [0.] and should be  1
Row 106 has been classified as  [1.] and should be  0
Row 107 has been classified as  [1.] and should be  0
Row 112 has been classified as  [0.] and should be  1
Row 122 has been classified as  [1.] and should be  0
Row 137 has been classified as  [1.] and should be  0
Row 150 has been classified as  [1.] and should be  0
Row 155 has been classified as  [1.] and should be  0
Row 156 has been classified as  [0.] an

Row 3185 has been classified as  [1.] and should be  0
Row 3187 has been classified as  [0.] and should be  1
Row 3209 has been classified as  [0.] and should be  1
Row 3214 has been classified as  [0.] and should be  1
Row 3238 has been classified as  [0.] and should be  1
Row 3265 has been classified as  [1.] and should be  0
Row 3280 has been classified as  [1.] and should be  0
Row 3282 has been classified as  [0.] and should be  1
Row 3291 has been classified as  [0.] and should be  1
Row 3312 has been classified as  [0.] and should be  1
Row 3323 has been classified as  [0.] and should be  1
Row 3336 has been classified as  [1.] and should be  0
Row 3347 has been classified as  [0.] and should be  1
Row 3406 has been classified as  [1.] and should be  0
Row 3434 has been classified as  [0.] and should be  1
Row 3443 has been classified as  [0.] and should be  1
Row 3458 has been classified as  [0.] and should be  1
Row 3483 has been classified as  [0.] and should be  1
Row 3488 h

### SMOTE (honestly don't know what it stands for)

In [32]:
smo = SMOTE()

X_train_smote, y_train_smote = smo.fit_resample(X_train_sc, y_train)

In [33]:
model = Sequential()
model.add(Dense(64, input_shape=(X_train_smote.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_smote,
    y_train_smote,
    validation_data=(X_test_sc, y_test),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 00018: early stopping


#### Analysis 

Results: loss: 0.0049 - accuracy: 0.9990 - val_loss: 0.0350 - val_accuracy: 0.9986

This is still very wrong. There must be a variable giving it away.

In [34]:
preds = np.round(model.predict(X_test_sc),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, preds).ravel()



df = pd.DataFrame(metrics.confusion_matrix(y_test, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,186,3
actual_prison,2,4017


In [35]:
for row_index, (input, prediction, label) in enumerate(zip (X_test_sc_no_dem, preds, y_test)):
    if prediction != label:
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 624 has been classified as  [1.] and should be  0
Row 2108 has been classified as  [1.] and should be  0
Row 2476 has been classified as  [0.] and should be  1
Row 2696 has been classified as  [1.] and should be  0
Row 2899 has been classified as  [0.] and should be  1


## Without Demographic information

In [37]:
ros = RandomOverSampler()

X_train_no_dem_over, y_train_no_dem_over = ros.fit_resample(X_train_sc_no_dem, y_train_no_dem)

In [39]:
model = Sequential()
model.add(Dense(64, input_shape=(X_train_no_dem_over.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='bce', 
              optimizer='adam', 
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(
    X_train_no_dem_over,
    y_train_no_dem_over,
    validation_data=(X_test_sc_no_dem, y_test_no_dem),
    epochs=200,
    callbacks=[early_stop]
)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 00015: early stopping


#### Analysis:
Results: loss: 0.2942 - accuracy: 0.9568 - val_loss: 0.1175 - val_accuracy: 0.9665

As expected, a very accurate model. However, not a massive improvement over our null model and therefore likely suffering because of that imbalance.
> Interestingly, we do see a drop in accuracy after removing demographic information. However, the model is still suffering from the class imbalance and the accuracy was swinging from 95-99%.

In [40]:
preds = np.round(model.predict(X_test_sc_no_dem),0)
tn, fp, fn, tp = metrics.confusion_matrix(y_test_no_dem, preds).ravel()



df = pd.DataFrame(metrics.confusion_matrix(y_test_no_dem, preds), 
                  columns=['predicted_no_prison', 'predicted_prison'], 
                  index=['actual_no_prison', 'actual_prison']
                 )
df

Unnamed: 0,predicted_no_prison,predicted_prison
actual_no_prison,186,3
actual_prison,3,4016


In [18]:
for row_index, (input, prediction, label) in enumerate(zip (X_test_sc_no_dem, preds, y_test_no_dem)):
    if prediction != label:
        print('Row', row_index, 'has been classified as ', prediction, 'and should be ', label)

Row 4 has been classified as  [1.] and should be  0
Row 10 has been classified as  [1.] and should be  0
Row 17 has been classified as  [0.] and should be  1
Row 22 has been classified as  [1.] and should be  0
Row 27 has been classified as  [1.] and should be  0
Row 28 has been classified as  [0.] and should be  1
Row 31 has been classified as  [0.] and should be  1
Row 37 has been classified as  [0.] and should be  1
Row 52 has been classified as  [0.] and should be  1
Row 61 has been classified as  [1.] and should be  0
Row 104 has been classified as  [0.] and should be  1
Row 106 has been classified as  [1.] and should be  0
Row 107 has been classified as  [1.] and should be  0
Row 112 has been classified as  [0.] and should be  1
Row 122 has been classified as  [1.] and should be  0
Row 137 has been classified as  [1.] and should be  0
Row 150 has been classified as  [1.] and should be  0
Row 155 has been classified as  [1.] and should be  0
Row 156 has been classified as  [0.] an

Row 3185 has been classified as  [1.] and should be  0
Row 3187 has been classified as  [0.] and should be  1
Row 3209 has been classified as  [0.] and should be  1
Row 3214 has been classified as  [0.] and should be  1
Row 3238 has been classified as  [0.] and should be  1
Row 3265 has been classified as  [1.] and should be  0
Row 3280 has been classified as  [1.] and should be  0
Row 3282 has been classified as  [0.] and should be  1
Row 3291 has been classified as  [0.] and should be  1
Row 3312 has been classified as  [0.] and should be  1
Row 3323 has been classified as  [0.] and should be  1
Row 3336 has been classified as  [1.] and should be  0
Row 3347 has been classified as  [0.] and should be  1
Row 3406 has been classified as  [1.] and should be  0
Row 3434 has been classified as  [0.] and should be  1
Row 3443 has been classified as  [0.] and should be  1
Row 3458 has been classified as  [0.] and should be  1
Row 3483 has been classified as  [0.] and should be  1
Row 3488 h

---

# Fix Inaccurate Accuracy

In [None]:
# df.drop(columns= [], inplace=True)

## GridSearch

I modified the code from "GridSearch with keras" by Riley Dallas and Adi bronshtein; shown to me by Eric Bayless. It is almost identical.

**Note:**

Issue with ```early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1)```

I changed ```val_loss``` for ```loss``` on previous attempts.

In [None]:
# Add an arugment of number of layers to the function (and loop through it) 

break

def model_fn_deep(hidden_neurons=32, hidden_layers=5, dropout=0.5):
    model=Sequential()
    
    for layer in range(hidden_layers):
        if layer == 0:
            model.add(Dense(hidden_neurons, input_shape=(X.shape[1],), activation='relu'))
            model.add(Dropout(dropout))
        else:
            model.add(Dense(hidden_neurons, activation='relu'))
            model.add(Dropout(dropout))
            
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(loss='bce', metrics=['acc'], optimizer='adam')
    
    return model

nn_deep = KerasClassifier(build_fn = model_fn_deep, batch_size=32, verbose=0)

early_stop = EarlyStopping(monitor='val_loss', patience=10, verbose=1) ######## Can be an issue ##########

params_deep = {
    'hidden_neurons': [16,32,64,128,256,512,1024],
    'hidden_layers': [2,3,4,5,6,7,8,9,10],
    'dropout': [0.1,0.2,0.3,0.4,0.5],
    'epochs': [10,20,50,100],
    'callbacks': [early_stop]
}

gs_deep = GridSearchCV(nn_deep, param_grid=params_deep, cv=5, n_jobs=-1)

gs_deep.fit(X_train_sc, y_train)
print(gs_deep.best_score_)
gs_deep.best_params_