# Nextbit Test

### Data description
`deaths.csv` contains US data, for 1997-2002, from police-reported car crashes in which there is a harmful event (people or property), and from which at least one vehicle was towed. Data are restricted to front-seat occupants, include only a subset of the variables recorded, and are restricted in other ways also.

**Format:**
A data frame with 26217 observations on the following 15 variables.

- `dvcat`: ordered factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
- `weight`:
observation weights, albeit of uncertain accuracy, designed to account for varying sampling probabilities.
- `dead`:
factor with levels alive dead
- `airbag`:
a factor with levels none airbag
- `seatbelt`:
a factor with levels none belted
- `frontal`:
a numeric vector; 0 = non-frontal, 1=frontal impact
- `sex`:
a factor with levels f m
- `ageOFocc`:
age of occupant in years
- `yearacc`:
year of accident
- `yearVeh`:
Year of model of vehicle; a numeric vector
- `abcat`:
did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy nodeploy unavail
- `occRole`:
a factor with levels driver pass
- `deploy`:
a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags deployed.
- `injSeverity`:
a numeric vector; 0:none, 1:possible injury, 2:no incapacity, 3:incapacity, 4:killed; 5:unknown, 6:prior death
- `caseid`:
character, created by pasting together the populations sampling unit, the case number, and the vehicle number. Within each year, use this to uniquely identify the vehicle.

### Exercises
- **E1.** Develop different models to predict the variable `dead` (alive or dead). Explain your choices.
- **E2.** Select the best model and explain your choice.
- **E3.** *(optional)* Train a neural network to predict the variable `dead`. Explain your choices. Did you achieve a better performance with respect to the model selected in **E2.**? Why?

### Classifier: Neural Network

In [1]:
import numpy as np
import pandas as pd
from sklearn.utils import resample, shuffle

np.random.seed(123)

In [2]:
X = pd.read_csv("deaths.csv") # read the csv
X.drop(['Unnamed: 0'], axis=1, inplace=True)
X.head() # visualize data frame head

Unnamed: 0,dvcat,weight,dead,airbag,seatbelt,frontal,sex,ageOFocc,yearacc,yearVeh,abcat,occRole,deploy,injSeverity,caseid
0,25-39,25.069,alive,none,belted,1,f,26,1997,1990.0,unavail,driver,0,3.0,2:3:1
1,10-24,25.069,alive,airbag,belted,1,f,72,1997,1995.0,deploy,driver,1,1.0,2:3:2
2,10-24,32.379,alive,none,none,1,f,69,1997,1988.0,unavail,driver,0,4.0,2:5:1
3,25-39,495.444,alive,airbag,belted,1,f,53,1997,1995.0,deploy,driver,1,1.0,2:10:1
4,25-39,25.069,alive,none,belted,1,f,32,1997,1988.0,unavail,driver,0,3.0,2:11:1


In [3]:
X.isnull().sum() # check for missing values

# injSeverity and yearacc has missing values

dvcat            0
weight           0
dead             0
airbag           0
seatbelt         0
frontal          0
sex              0
ageOFocc         0
yearacc          0
yearVeh          1
abcat            0
occRole          0
deploy           0
injSeverity    153
caseid           0
dtype: int64

In [4]:
X['dead'].value_counts() # the target class is imbalanced.

alive    25037
dead      1180
Name: dead, dtype: int64

### Imputing missing value
I have chosen to impute the missing values for variable 'injSeverity' because there are a considerables no.of rows with missing value for Series 'injSeverity'

In [5]:
# impute missing values with 5.0 or 'unknown' values since 
X.loc[X["injSeverity"].isnull(), "injSeverity"] = 5.0

In [6]:
# drop row(s) containing missing values. One missing value
# for the variable 'yearVeh' is dropped
X.dropna(how='any', inplace=True)

### Label Encoding
Label encoding is done because the RF classifier is unable to handle character values

In [7]:
# performing label encoding for columns with string values

# dead column
X['dead'] = X.dead.map({'alive': 1, 'dead': 0})

# airbag column
X['airbag'] = X.airbag.map({'airbag': 1, 'none': 0})

# seatbelt column
X['seatbelt'] = X.seatbelt.map({'belted': 1, 'none': 0})

# sex column
X['sex'] = X.sex.map({'m': 1, 'f': 0})

# abcat column
X['abcat'] = X.abcat.map({'unavail': 2, 'deploy': 1, 'nodeploy': 0})

# occRole column
X['occRole'] = X.occRole.map({'driver': 1, 'pass': 0})

# dvcat column
X['dvcat'] = X.dvcat.map({'1-9km/h': 0, '10-24': 1, '25-39': 2, '40-54': 3, '55+': 4})

In [8]:
X.head() # visualize data after encoding

Unnamed: 0,dvcat,weight,dead,airbag,seatbelt,frontal,sex,ageOFocc,yearacc,yearVeh,abcat,occRole,deploy,injSeverity,caseid
0,2,25.069,1,0,1,1,0,26,1997,1990.0,2,1,0,3.0,2:3:1
1,1,25.069,1,1,1,1,0,72,1997,1995.0,1,1,1,1.0,2:3:2
2,1,32.379,1,0,0,1,0,69,1997,1988.0,2,1,0,4.0,2:5:1
3,2,495.444,1,1,1,1,0,53,1997,1995.0,1,1,1,1.0,2:10:1
4,2,25.069,1,0,1,1,0,32,1997,1988.0,2,1,0,3.0,2:11:1


In [9]:
Y = X.pop('dead') # target values

In [10]:
# divide the dataset into train and test set.
# train set will be used to build the model
# test set will be used to evaluate the model

# stratified split is used to preserve the ratio between the majority
# and minority class
from sklearn.model_selection import StratifiedShuffleSplit

ss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=2)
for train_index, test_index in ss.split(X, Y):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.loc[train_index,], X.loc[test_index,]
    y_train, y_test = Y.loc[train_index,], Y.loc[test_index,]
    

X_train = pd.concat([X_train, y_train], axis=1)

# drop the unique identifier 'caseid'
X_train.drop(['caseid'], axis=1, inplace=True)

X_train.dropna(how='any', inplace=True)


In [11]:
X_train.head()

Unnamed: 0,dvcat,weight,airbag,seatbelt,frontal,sex,ageOFocc,yearacc,yearVeh,abcat,occRole,deploy,injSeverity,dead
1772,1.0,763.016,1.0,1.0,0.0,0.0,31.0,1997.0,1992.0,0.0,1.0,0.0,0.0,1.0
15861,1.0,10.579,0.0,1.0,1.0,1.0,27.0,2000.0,1991.0,2.0,0.0,0.0,1.0,1.0
13220,1.0,725.853,1.0,1.0,0.0,0.0,30.0,2000.0,1996.0,0.0,0.0,0.0,1.0,1.0
2349,0.0,97.905,1.0,1.0,0.0,0.0,28.0,1997.0,1996.0,0.0,0.0,0.0,0.0,1.0
23334,4.0,649.395,1.0,1.0,1.0,1.0,22.0,2002.0,2001.0,1.0,1.0,1.0,1.0,1.0


### SMOTE: Synthetic Minority Oversampling TEchnique

Since the dataset is highly imbalanced, Neural Net might not work with such high imbalance. Thus, oversampling of the minority class is needed. This is done through the DMwR package in R. Hence, I am exporting the dataset to be used in R.

Python implementation of SMOTE also exists but I faced some problems while using it.

In [None]:
# export the data frames to csv
train_path = "/home/subhankar/Documents/sklearn/nextbit-test/train_data_new.csv"
X_train.to_csv(path_or_buf=train_path)

### R code:

library(DMwR)
dataset <- read.csv("train_data_new.csv") <br/>
dataset\$dead <- as.factor(dataset\$dead) <br/>
newdata <- SMOTE(dataset\$dead~., dataset, perc.over = 2000, perc.under = 100) <br/>
write.csv(newdata, 'smoted_train_data.csv')

In [12]:
X_train = pd.read_csv("smoted_train_data.csv")
X_train = shuffle(X_train)
y_train = X_train.pop('dead')
X_train.drop(labels=['Unnamed: 0', 'X'], axis=1, inplace=True)

In [22]:
y_train.value_counts() # now the SMOTE-ed dataset has almost equal representations for each class

0    17220
1    16400
Name: dead, dtype: int64

### Neural Network

In [14]:
import tensorflow as tf

In [15]:
# perform one hot encoding
y_one_hot_train = np.eye(2)[y_train]

# convert the dataframe into a numpy array
X_train = X_train.as_matrix()

test_id = X_test.pop('caseid') # drop unwanted column
# perform one hot encoding
y_one_hot_test = np.eye(2)[y_test]

# convert the dataframe into a numpy array
X_test = X_test.as_matrix()

In [16]:
# declare weights and biases
w1_initial = np.random.normal(size=(13, 5)).astype(np.float32)
w2_initial = np.random.normal(size=(5, 2)).astype(np.float32)

b1_initial = np.random.normal(size=(5)).astype(np.float32)
b2_initial = np.random.normal(size=(2)).astype(np.float32)

In [17]:
def _create_network():
    
    # layer 1
    with tf.variable_scope('fc_1'):
        w1 = tf.Variable(w1_initial)
        b1 = tf.Variable(b1_initial)
        fc1 = tf.add(tf.matmul(x, w1), b1)
        fc1 = tf.nn.sigmoid(fc1)
    
    # layer 2
    with tf.variable_scope('fc_2'):
        w2 = tf.Variable(w2_initial)
        b2 = tf.Variable(b2_initial)
        fc2 = tf.add(tf.matmul(fc1, w2), b2)
        #fc2 = tf.nn.sigmoid(fc2)
    pred = fc2
    
    # calculate cost
    
    #weights = tf.reduce_sum(class_weights * y, axis=1)
    #unweighted_losses = tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=pred)
    #weighted_losses = unweighted_losses * weights
    #cost = tf.reduce_mean(weighted_losses)
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    #cost = tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=pred, targets=y, pos_weight=0.04))
    optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost, global_step=global_step)
    
    # evaluate model
    y_prediction = tf.argmax(pred, 1)
    correct_pred = tf.equal(y_prediction, tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    
    return cost, optimizer, accuracy, y_prediction

In [18]:
tf.reset_default_graph()
#class_weights = tf.constant([[1.5, 1.0]])

# declare placeholders
x = tf.placeholder(tf.float32, [None, 13])
y = tf.placeholder(tf.float32, [None, 2])

global_step = tf.Variable(initial_value=0, name='global_step', trainable=False)

cost, optimizer, accuracy, y_prediction = _create_network()

epochs = 100
num_train = len(X_train)
batch_size = 50

# train the network

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for _iter in range(epochs):
        
        begin = 0
        while begin < num_train:
            end = min(begin + batch_size, num_train)     
            train_dict = {x:X_train[begin:end,], y:y_one_hot_train[begin:end]}
            begin = end
            
            i_global, _, c, acc = sess.run([global_step, optimizer, cost, accuracy], feed_dict=train_dict)
            
            if i_global % 1000 == 0:
                print("Epoch: {0}. Iteration: {1}. Mini-batch cost: {2:.4f}. Train accuracy:{3:.4f}".format(_iter,
                                                                                                    i_global,
                                                                                                    c,
                                                                                                    acc))
    print("Optimization Done")
    
    test_dict = {x:X_test, y:y_one_hot_test}
    acc_test, y_pred = sess.run([accuracy, y_prediction], feed_dict=test_dict)
    print("Testing accuracy: {0:.2f}".format(acc_test*100))

Epoch: 1. Iteration: 1000. Mini-batch cost: 0.6948. Train accuracy:0.4800
Epoch: 2. Iteration: 2000. Mini-batch cost: 0.6427. Train accuracy:0.5600
Epoch: 4. Iteration: 3000. Mini-batch cost: 0.5448. Train accuracy:0.7000
Epoch: 5. Iteration: 4000. Mini-batch cost: 0.6279. Train accuracy:0.6800
Epoch: 7. Iteration: 5000. Mini-batch cost: 0.6204. Train accuracy:0.6400
Epoch: 8. Iteration: 6000. Mini-batch cost: 0.6744. Train accuracy:0.6000
Epoch: 10. Iteration: 7000. Mini-batch cost: 0.6987. Train accuracy:0.5200
Epoch: 11. Iteration: 8000. Mini-batch cost: 0.4325. Train accuracy:0.8600
Epoch: 13. Iteration: 9000. Mini-batch cost: 0.5666. Train accuracy:0.7400
Epoch: 14. Iteration: 10000. Mini-batch cost: 0.5314. Train accuracy:0.7400
Epoch: 16. Iteration: 11000. Mini-batch cost: 0.7320. Train accuracy:0.5800
Epoch: 17. Iteration: 12000. Mini-batch cost: 0.4997. Train accuracy:0.7800
Epoch: 19. Iteration: 13000. Mini-batch cost: 0.6046. Train accuracy:0.6800
Epoch: 20. Iteration: 14000

In [19]:
from sklearn.metrics import f1_score, confusion_matrix, roc_auc_score
print("F1-score: {0}".format(f1_score(y_test, y_pred, average='micro'))) # f1-score

F1-score: 0.892180546726


In [20]:
# print the confusion matrix
conf_mat = confusion_matrix(y_test, y_pred) # confusion matrix
conf_mat

array([[ 356,    4],
       [ 844, 6661]])

In [21]:
y_test.value_counts() # true class counts of the test set

1    7505
0     360
Name: dead, dtype: int64

### Discussion

Training a neural net on a dataset with very few features is quite tricky. I adopted the following techniques before I obtained a decent result on the test set:

1. I trained the net with the original imbalanced dataset and it was giving a very high accuracy. It is misleading since it is classifying everything to the majority class to incurr less error.
2. I tried to train the model with penalised cross entropy loss. I was putting more stress on missclassification of the minority class as compared to the majority. But it turned out that the classifier was voting everything for the minority class. I was unable to find the optimum weights even with grid search on the weights.
3. Finally, I tried SMOTE, where synthetic examples of the minority classes are generated. With this expanded dataset containing almost equal representation from both the classes, the model was indeed able to find a better classification boundary.

As we can see, out of 360 <b>dead</b> observations, it is able to correctly classify <b>356</b> observations but it wrongly classifies more <b>alive</b> subjects as <b>dead</b> as compared to the other classifiers.

Hence, in my case, the Neural Net was not able to outperform the Random Forest Classifier. Possible reason could be that a neural net is learning a complicated decision boundary that doesn't generalize too well on the test data.