# Classification with Tensorflow/Keras

In this notebook we will use a simple DNN for classification. We will use the well-known titanic dataset in order to predict passenger survival.

Since you've already worked with the dataset, we'll skip the exploration part and do just a quick and dirty clean up before we start modelling.

## Setup

We'll start as always with importing the necessary libraries and the dataset.

In [None]:
# Import all the libraries I need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# ignore Deprecation Warning
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow.keras.utils import plot_model

import keras 
from keras.models import Sequential # intitialize the ANN
from keras.layers import Dense, Activation, Dropout      # create layers

np.random.seed(42)
tf.random.set_seed(42)


# Load dataset
df = pd.read_csv('../data/titanic.csv')



## Data Cleaning

Looking at the information about each column and the missing values shows us that we have to clean our data before we can use it for modelling.

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()

Replace the 2 missings in Embarked with the most common class S

In [None]:
df.Embarked.isnull().sum(axis=0)

In [None]:
df.describe(include=['O']) # S is the most common

In [None]:
# fill the NAN
df.Embarked.fillna('S' , inplace=True )

Replace the age based on the median of each Pclass

In [None]:
df.groupby('Pclass').Age.median()

In [None]:
# Define function for replacing missing age values based on passenger class 
def age_approx(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age
    
    
# Replace missing age values with previously defined function 
df['Age'] = df[['Age', 'Pclass']].apply(age_approx, axis=1)

Drop Cabin due to too many missings

In [None]:
# Let's drop cabin since the majority of entries are missing 
df.drop('Cabin', axis=1, inplace=True)

# We'll also drop the remaining NaN's (2 entries in the Embarked column)
df.dropna(inplace=True)

# Now our dataset is clean (enough)
df.isnull().sum()

## Preparation of Features for Model

First we define our features and our label. Then we preprocess our data based on the type (categorical or numeric). 

In [None]:
# Define features and label 
y = df['Survived']
X = df.loc[:,['Age', 'Fare', 'Sex', 'SibSp', 'Parch', 'Pclass', 'Embarked']]

In [None]:
# Let's define our numerical and categorical features
cat_columns = ['Sex', 'SibSp', 'Parch', 'Pclass', 'Embarked']
num_columns = ['Age', 'Fare']

In [None]:
X = pd.get_dummies(X, columns=cat_columns, drop_first=True, dtype='uint8')

Now we split our data into train- and test-sets.

In [None]:
# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state = 42)

In [None]:
scaler = StandardScaler()
X_train[num_columns] = scaler.fit_transform(X_train[num_columns])
X_test[num_columns] = scaler.transform(X_test[num_columns])

In [None]:
X_train.shape

Up to this point the steps should look familiar to you. But now we will create our very simple Dense Neural Network with: 
- an input layer with 19 nodes  
- a hidden layer with 9 nodes 
- a hidden layer with 9 nodes  
- a dropout layer (The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.)
- an output layer which is used to predict a passenger's survival. The output layer has a sigmoid activation function, which is used to 'squash' all our outputs to be between 0 and 1.

We then compile our NN with different hyperparameters like:

1. Optimizers:
While the architecture of the Neural Network plays an important role when extracting information from data, all (most) are being optimized through update rules based on the gradient of the loss function.
The update rules are determined by the Optimizer. The performance and update speed may heavily vary from optimizer to optimizer. The gradient tells us the update direction, but it is still unclear how big of a step we might take. Short steps keep us on track, but it might take a very long time until we reach a (local) minimum. Large steps speed up the process, but it might push us off the right direction.
Adam and RMSProp are two very popular optimizers still being used in most neural networks. Both update the variables using an exponential decaying average of the gradient and its squared. But there are more, have a [look](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers).

2. Loss
Different [loss functions](https://www.tensorflow.org/api_docs/python/tf/keras/losses)

3. Metrics

4. and so on. There are many. Have a look yourself in the [documentary](https://www.tensorflow.org/api_docs/python/tf/keras)

In [None]:
# Initialising the NN
model = Sequential()

# layers
model.add(Dense(units = 9, kernel_initializer = 'uniform', activation = 'relu', input_dim = 19))
model.add(Dense(units = 9, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
print(model.summary())

In [None]:
plot_model(
    model, to_file='model.png', show_shapes=True, 
    show_layer_names=True, dpi=96
)

Now we will train our model

In [None]:
# Train the ANN
training = model.fit(X_train, y_train, batch_size = 48, validation_split=0.2, epochs = 200)

Lets have a look at our accuracy at each epoch on the train- and validationset

In [None]:
# summarize history for accuracy
plt.plot(training.history['accuracy'])
plt.plot(training.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

And now we will predict the labels for our testset and look at the confusion matrix and the accuracy score.

In [None]:
y_pred = model.predict(X_test)

# Plotting the confusing matrix
mat = confusion_matrix(y_test, y_pred.round())
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

In [None]:
accuracy_score(y_test, y_pred.round()).round(2)

**Excercise:** 
- What happens when you add layers to our model? 
- What happens when you change the number of nodes?
- What when you change the batch size or the optimizer?