## Keras Deep Neural Network

Author = Grant Redfield

Goal: This script will be used to predict a "1" or a "0" in the test dataset "exercise_40_test.csv" based on 100 different features. I will use a neural network for this prediction

Data used: exercise_40_train.csv, exercise_40_test.csv



### Import Libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras import regularizers
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from keras import metrics
import gc
import os
import pickle
import keras_tuner   
from sklearn.preprocessing import OneHotEncoder
tf.random.set_seed(42)
pd.options.display.max_rows = 400
import warnings
from keras.models import Model, load_model
from keras import backend as K
from keras.layers import Input, Dense, Bidirectional
from keras.layers import Dropout
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
SF_Train = pd.read_csv("C:\\Users\\Grant\\Desktop\\Data_Science\\StateFarm\\exercise_40_train.csv")

### PreProcessing Function + Feature Engineering

1. Day of the week had multiple variations of the same day (ex. Sat + Saturday). Normalized this

2. Removed Characters like "$" and "%" to create numeric columns

3. Dropped Unnecesary Column "x39" which contained the same value throughout the dataframe

4. Added New Column considering what part of the week it is (Weekday or Weekend)

5. Added New Column considering what time of year it is (Season)

In [3]:
def FixPandas_DF(df):
    map_replace = {
"Sat" : "Saturday" ,
"Saturday" : "Saturday" ,
"Sun" :"Sunday" ,
"Mon" : "Monday" ,
"Monday" : "Monday",
"Tue" : "Tuesday",
"Wed": "Wednesday" ,
"Wednesday": "Wednesday" ,
"Thur" : "Thursday",
"Thursday" : "Thursday",
"Fri": "Friday" ,
"Friday" : "Friday"
}
    df['x3'] = df['x3'].map(map_replace).fillna(SF_Train['x3']) #Point number 1
    df['x19'] = df['x19'].str.replace('$', '').astype(float) # Point Number 2
    df['x7'] = df['x7'].str.replace('%', '').astype(float) # Point Number 2
    df = df.drop(['x39'], axis=1) # Point Number 3
   
    # Point Number 4
    conditions = [
    (df['x3'] == 'Saturday') | ( df['x3'] == 'Sunday'),
    (df['x3'] == "Monday") | (df['x3'] == "Tuesday") | (df['x3'] == "Wednesday" )| (df['x3'] == "Thursday" )| (df['x3'] == "Friday" )]
    values = ['Weekend', 'Weekday']
    df['Part_Of_Week'] = np.select(conditions, values)
    conditions = [
    (df['x60'] == 'March') | ( df['x60'] == 'April') | ( df['x60'] == 'April'),
    (df['x60'] == 'June') | ( df['x60'] == 'July') | ( df['x60'] == 'August'),
    (df['x60'] == 'September') | ( df['x60'] == 'October') | ( df['x60'] == 'November'),
    (df['x60'] == 'December') | ( df['x60'] == 'January') | ( df['x60'] == 'February')]
    values = ['Spring', 'Summer', 'Autumn', 'Winter']
    df['Season'] = np.select(conditions, values)
    return df

In [4]:
def setScaler(x_data, x_values_to_scale):
    scaler = StandardScaler()
    scaler.fit(x_data[x_values_to_scale])
    return scaler


def setOneHotEncoder(x_data, categorical_cols):
    
    transformer = make_column_transformer((OneHotEncoder( handle_unknown='ignore'), 
                                       categorical_cols), remainder='passthrough')
    one_hot_transformer = transformer.fit(x_data)

    with open("C:\\Users\\Grant\\Desktop\\Data_Science\\StateFarm\\SF_encoder", "wb") as f: 
        pickle.dump(one_hot_transformer, f)
    
    return one_hot_transformer


def TransformData(x_data, y_data, numeric_scaler, numeric_cols, categorical_cols, column_means, one_hot_transformer, add_noise):
    np.random.seed(42)
    
    for col in column_means.columns:
        x_data[col] = x_data[col].fillna(value = column_means[col][0])
    
    x_data[numeric_cols] = numeric_scaler.transform(x_data[numeric_cols])
    
    if add_noise == "Yes":
        for col in column_means.columns:
            x_data[col] = x_data[col].apply(lambda x: x + np.random.normal(0, 0.1,1)[0] )
    
    x_data[numeric_cols] = x_data[numeric_cols].fillna(0)
    
    
    x_data[categorical_cols] = x_data[categorical_cols].fillna('missing')
    x_data[categorical_cols] = x_data[categorical_cols].astype('string')
    x_data[categorical_cols] = x_data[categorical_cols].astype('category')
    

    #print(x_data.isnull().sum())
    
    x_data_only_cats = one_hot_transformer.transform(x_data)
    data_hot_encoded = pd.DataFrame(x_data_only_cats, index=x_data.index)
    #Extract only the columns that didnt need to be encoded
    x_data = x_data.drop(columns=categorical_cols)
    #Concatenate the two dataframes : 
    x_data = pd.concat([data_hot_encoded, x_data], axis=1)
    x_data = x_data.to_numpy()
    y_data = y_data.to_numpy()
    return x_data, y_data

In [5]:
SF_Train = FixPandas_DF(SF_Train)
category_column = ['x3', 'x24','x31','x33','x59','x60','x65','x77','x79','x93','x98','x99','Part_Of_Week','Season']  
numeric_cols = SF_Train.columns
x_values_to_scale= [x for x in numeric_cols if x not in category_column and x != 'y']



## Up Sample dataset Where Y = 1

Since this is a fairly imbalanced dataset, we need to upsample where Y == 1 so our Model can more accurately identify this class. 

In [6]:
y_values = SF_Train[SF_Train['y'] == 1]
SF_Train = pd.concat([SF_Train, y_values], ignore_index=True, sort=False)
SF_Train = pd.concat([SF_Train, y_values], ignore_index=True, sort=False)
SF_Train = pd.concat([SF_Train, y_values], ignore_index=True, sort=False)

## Data Transformations

1. First I find the means of all training columns to fill in Null values

2. create a scalar based on the training data

3. Create a OneHotEncoder to change all categorical columns to numeric

4. ONLY TRAINING SET: Added Gausssian noise to numeric columns

4. Transform All Datasets based on previous steps

In [7]:
numeric_scaler = setScaler(SF_Train.loc[:, SF_Train.columns != 'y'], x_values_to_scale)
one_hot_encoder = setOneHotEncoder(SF_Train.loc[:, SF_Train.columns != 'y'], category_column)

X_train, X_test, y_train, y_test = train_test_split(
    SF_Train.loc[:, SF_Train.columns != 'y'], SF_Train['y'], test_size=0.33, random_state=42)


SF_Train_column_means = X_train[x_values_to_scale].mean()
SF_Train_column_means = pd.DataFrame(SF_Train_column_means)
SF_Train_column_means = SF_Train_column_means.T
numeric_scaler = setScaler(X_train.loc[:, X_train.columns != 'y'], x_values_to_scale)
one_hot_encoder = setOneHotEncoder(X_train.loc[:, X_train.columns != 'y'], category_column)

X_test, X_val, y_test, y_val = train_test_split(
    X_train, y_train, test_size=0.5, random_state=42)

X_train, y_train = TransformData(X_train, y_train, numeric_scaler, x_values_to_scale, category_column, SF_Train_column_means, one_hot_encoder, "Yes")

X_test, y_test = TransformData(X_test, y_test, numeric_scaler, x_values_to_scale, category_column, SF_Train_column_means, one_hot_encoder, "No")


X_val, y_val = TransformData(X_val, y_val, numeric_scaler, x_values_to_scale, category_column, SF_Train_column_means, one_hot_encoder, "No")

## Metrics I Will Use To Evaluate My Model

In [8]:
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))


## Finding The Best Hyper Parameters Through Bayesian Optimization

I will test the activation function and Learning Rate for my Neural Network

Previous tests concluded that Dropout Rate should be 0 and activations elu and selu did not perform well

In [9]:
def build(hp):
    activation = hp.Choice('activation',
                           values=['relu','swish','tanh'],
                          ordered=False)
           
#    D_out = hp.Float(
#                       'D_out', 
#                       min_value=0.0,
#                       max_value=0.60,
#                       default=0.2)    
    learn_r =  hp.Float(
        'learn_r',
        min_value=1e-7,
        max_value=1e-4,
        default=1e-6
            )    
    model = keras.Sequential()
    model.add(Dense(300, activation=activation, input_shape=(X_train.shape[1],)))
    #model.add(Dropout(D_out))
    model.add(Dense(200,  activation=activation))
    model.add(Dense(100, activation=activation))
    model.add(Dense(50, activation=activation))
    model.add(Dense(25, activation=activation))
    model.add(Dense(13, activation=activation))
    model.add(Dense(5, activation=activation))
    model.add(Dense(1,  activation='sigmoid'))   
    model.compile(
          optimizer=keras.optimizers.Adam(
     learning_rate = learn_r

        ),
            loss='binary_crossentropy',
                  metrics=['accuracy',f1_m,precision_m, recall_m, tf.keras.metrics.AUC()],
        )
    return model

In [10]:
bayesian_opt_tuner = keras_tuner.BayesianOptimization(
    build,
    objective=keras_tuner.Objective("val_accuracy", direction="max"),
    max_trials=10,
    executions_per_trial=1,
    directory=os.path.normpath('C:\\Users\\Grant\\Desktop\\Data_Science\\StateFarm\\'),
    project_name='STATEFARM' ,
    overwrite=True)
n_epochs=200

bayesian_opt_tuner.search(X_train,y_train,epochs=n_epochs,  callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_accuracy',  
               patience=8), tf.keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy', factor=0.2,
                              patience=7, min_lr=0.0001)],
     validation_data=(X_val, y_val),verbose=1)


bayes_opt_model_best_model = bayesian_opt_tuner.get_best_models(num_models=1)
model = bayes_opt_model_best_model[0]


Search: Running Trial #1

Value             |Best Value So Far |Hyperparameter
relu              |?                 |activation
6.1055e-05        |?                 |learn_r

Epoch 1/200

KeyboardInterrupt: 

In [17]:
model.evaluate(X_test,y_test)



[0.009998246096074581,
 0.9965682029724121,
 0.9954987168312073,
 0.9965898394584656,
 0.9947759509086609,
 0.9998933672904968]

## Neural Network Results

The DNN shows to be very accurate showing 0.9999 AUC.

In [18]:
def TransformFinalData(x_data, numeric_scaler, numeric_cols, categorical_cols, column_means, one_hot_transformer):
    np.random.seed(42)
    
    for col in column_means.columns:
        x_data[col] = x_data[col].fillna(value = column_means[col][0])
    
    x_data[numeric_cols] = numeric_scaler.transform(x_data[numeric_cols])
    
    x_data[numeric_cols] = x_data[numeric_cols].fillna(0)
    
    
    x_data[categorical_cols] = x_data[categorical_cols].fillna('missing')
    x_data[categorical_cols] = x_data[categorical_cols].astype('string')
    x_data[categorical_cols] = x_data[categorical_cols].astype('category')
    

    #print(x_data.isnull().sum())
    
    x_data_only_cats = one_hot_transformer.transform(x_data)
    data_hot_encoded = pd.DataFrame(x_data_only_cats, index=x_data.index)
    #Extract only the columns that didnt need to be encoded
    x_data = x_data.drop(columns=categorical_cols)
    #Concatenate the two dataframes : 
    x_data = pd.concat([data_hot_encoded, x_data], axis=1)
    x_data = x_data.to_numpy()
    return x_data

In [19]:
SF_TEST = pd.read_csv("C:\\Users\\Grant\\Desktop\\Data_Science\\StateFarm\\exercise_40_test.csv")
SF_TEST = FixPandas_DF(SF_TEST)
SF_TEST = TransformFinalData(SF_TEST, numeric_scaler, x_values_to_scale, category_column, SF_Train_column_means, one_hot_encoder)


In [21]:
result = model.predict(SF_TEST)



In [33]:
result_df = pd.DataFrame(result)

In [36]:
result_df.to_csv('nonglmresults.csv', index=False, header=False)