---
# <div align="center"><font color='red'>  </font></div>
# <div align="center"><font color='red'> COSC 2779/2972 | Deep Learning  </font></div>
## <div align="center"> <font color='red'> Week 2 Lectorial Example: **Feed-Forward Neural Networks**</font></div>
---

# Activation Function Basics

Based on:

> "7 popular activation functions you should know in Deep Learning and how to use them with Keras and TensorFlow 2" by B. Chen


In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

z = np.linspace(-7, 7, 200)

def derivative(f, z, eps=0.000001):
    return (f(z + eps) - f(z - eps))/(2 * eps)

## Sigmoid (Logistic)

The Sigmoid function (also known as the Logistic function) is one of the most widely used activation function. 

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

plt.figure(figsize=(11,4))

plt.subplot(1,2,1)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([-5, 5], [-1, -1], 'k--', linewidth=1)
plt.plot([0, 0], [-2.2, 3.2], 'k-', linewidth=1)
plt.plot([-7,7], [1,1], 'k--', linewidth=1)
# Plot sigmoid
plt.plot(z, sigmoid(z), "b-", linewidth=2, label="Sigmoid")
plt.grid(True)
plt.title("Sigmoid activation function", fontsize=14)
plt.axis([-7, 7, -0.2, 1.2])

plt.subplot(1,2,2)
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([-5, 5], [-1, -1], 'k--', linewidth=1)
plt.plot([0, 0], [-2.2, 3.2], 'k-', linewidth=1)
plt.plot(z, derivative(sigmoid, z), "b-", linewidth=2, label="Sigmoid")
plt.grid(True)
plt.title("Derivative", fontsize=14)
plt.axis([-7, 7, -0.2, 1.2])
plt.show()

- The function is a common S-shaped curve.
- The output of the function is centered at 0.5 with a range from 0 to 1.
- The function is differentiable. That means we can find the slope of the sigmoid curve at any two points.
- The function is monotonic but the function’s derivative is not.

The Sigmoid function was introduced to Artificial Neural Networks (ANN) to replace the Step function. It was a key change to ANN architecture because the Step function doesn’t have any gradient to work with Gradient Descent, while the Sigmoid function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress at every step during training.

**Problems with Sigmoid activation function**

- Vanishing gradient: looking at the function plot, you can see that when inputs become small or large, the function saturates at 0 or 1, with a derivative extremely close to 0. Thus it has almost no gradient to propagate back through the network, so there is almost nothing left for lower layers.
- Computationally expensive: the function has an exponential operation.
- The output is not zero centered:

**The above problems are of concern when sigmoid is used as activation for hidden layer. However, sigmoid is still used as an activation for the last layer. Why?**


## Tanh



In [None]:
plt.figure(figsize=(11,4))

plt.subplot(1,2,1)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([-5, 5], [-1, -1], 'k--', linewidth=1)
plt.plot([0, 0], [-2.2, 3.2], 'k-', linewidth=1)
plt.plot([-7,7], [1,1], 'k--', linewidth=1)
# Plot
plt.plot(z, np.tanh(z), "b-", linewidth=2, label="Tanh")
plt.grid(True)
plt.title("Tanh activation function", fontsize=14)
plt.axis([-5, 5, -1.2, 1.2])

plt.subplot(1,2,2)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([0, 0], [-2.2, 3.2], 'k-', linewidth=1)
# Plot
plt.plot(z, derivative(np.tanh, z), "b-", linewidth=2, label="Tanh")
plt.grid(True)
plt.title("Derivative", fontsize=14)
plt.axis([-5, 5, -1.2, 1.2])
plt.show()

- The function is a common S-shaped curve as well.
- The difference is that the output of Tanh is zero centered with a range from -1 to 1 (instead of 0 to 1 in the case of the Sigmoid function)
- The same as the Sigmoid, this function is differentiable
- The same as the Sigmoid, the function is monotonic, but the function’s derivative is not.

Tanh has characteristics similar to Sigmoid that can work with Gradient Descent. One important point to mention is that Tanh tends to make each layer’s output more or less centered around 0 and this often helps speed up convergence.

**Problems with Tanh activation function**

- Vanishing gradient: looking at the function plot, you can see that when inputs become small or large, the function saturates at -1 or 1, with a derivative extremely close to 0. Thus it has almost no gradient to propagate back through the network, so there is almost nothing left for lower layers.
- Computationally expensive: the function has an exponential operation.

## Rectified Linear Unit (ReLU)

The Rectified Linear Unit (ReLU) is the most commonly used activation function in deep learning. The function returns 0 if the input is negative, but for any positive input, it returns that value back.

In [None]:
def relu(z):
    return np.maximum(0, z)

plt.figure(figsize=(11,4))

plt.subplot(1, 2, 1)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([0, 0], [-2.2, 5], 'k-', linewidth=1)
# Plot
plt.plot(z, relu(z), "b-", linewidth=2, label="ReLU")
plt.grid(True)
plt.title("ReLU activation function", fontsize=14)
plt.axis([-5, 5, -0.2, 5])

plt.subplot(1, 2,2)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([0, 0], [-2.2, 5], 'k-', linewidth=1)
# Plot
plt.plot(z, derivative(relu, z), "b-", linewidth=2, label="ReLU")
plt.grid(True)
plt.title("Derivative", fontsize=14)
plt.axis([-5, 5, -0.2, 5])
plt.show()

- Graphically, the ReLU function is composed of two linear pieces to account for non-linearities. A function is non-linear if the slope isn’t constant. So, the ReLU function is non-linear around 0, but the slope is always either 0 (for negative inputs) or 1 (for positive inputs).
- The ReLU function is continuous, but it is not differentiable because its derivative is 0 for any negative input.
- The output of ReLU does not have a maximum value (It is not saturated) and this helps Gradient Descent
- The function is very fast to compute (Compare to Sigmoid and Tanh)

It’s surprising that such a simple function works very well in deep neural networks.

**Problem with ReLU**
ReLU works great in most applications, but it is not perfect. It suffers from a problem known as the dying ReLU.

> *Dying ReLU:
During training, some neurons effectively die, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting 0s, and gradient descent does not affect it anymore since the gradient of the ReLU function is 0 when its input is negative.(Hands-on Machine Learning, page 329)*.

## Leaky ReLU
Leaky ReLU is an improvement over the ReLU activation function. It has all properties of ReLU, plus it will never have dying ReLU problem. 

In [None]:
def leaky_relu(z, alpha=0.05):
    return np.maximum(alpha * z, z)

plt.figure(figsize=(11,4))

plt.subplot(1, 2, 1)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([0, 0], [-2.2, 5], 'k-', linewidth=1)
# Plot
plt.plot(z, leaky_relu(z, 0.05), 'b-', linewidth=2)
plt.plot([-5,5], [0,0], 'k-')
plt.plot([0,0], [-0.5, 5], 'k-')
plt.grid(True)
plt.title("Leaky ReLU activation function", fontsize=14)
plt.axis([-5,5,-0.5, 2])

plt.subplot(1, 2, 2)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([0, 0], [-2.2, 5], 'k-', linewidth=1)
# Plot
plt.plot(z, derivative(leaky_relu, z), "b-", linewidth=2, label="Leaky ReLU")
plt.grid(True)
plt.title("Derivative", fontsize=14)
plt.axis([-5,5,-0.05, 1.2])
plt.show()

## Exponential Linear Unit (ELU)
Exponential Linear Unit (ELU) is a variation of ReLU with a better output for z < 0

In [None]:
def elu(z, alpha=1):
    return np.where(z < 0, alpha * (np.exp(z) - 1), z)

plt.figure(figsize=(11,4))

plt.subplot(1,2,1)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([0, 0], [-2.2, 5], 'k-', linewidth=1)
# Plot
plt.plot(z, elu(z), 'b-', linewidth=2)
plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1, -1], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title("ELU activation function", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])

plt.subplot(1,2,2)
# coordinate 
plt.plot([-7, 7], [0, 0], 'k-', linewidth=1)
plt.plot([0, 0], [-2.2, 5], 'k-', linewidth=1)
# Plot
plt.plot(z, derivative(elu, z), "b-", linewidth=2, label="ELU")
plt.grid(True)
plt.title("Derivative", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])
plt.show()

- ELU modified the slope of the negative part of the function.
- Unlike the Leaky ReLU and PReLU functions, instead of a straight line, ELU uses a log curve for the negative values.

According to the authors, ELU outperformed all the ReLU variants in their experiments.

**Problem with ELU**

The main drawback of the ELU activation is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but during training this is compensated by the faster convergence rate. However, at test time, an ELU network will be slower than a ReLU network.



## How to choose an activation function?

We have gone through several different activation functions used in deep learning. When building a model, the selection of activation functions is critical. So which activation function should you use? Here is a general suggestion from the book Hands-on ML

> Although your mileage will vary, in general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If the network’s architecture prevents it from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0). If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don’t want to tweak yet another hyperparameter, you may just use the default α values used by Keras (e.g., 0.3 for the leaky ReLU). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular, RReLU if your network is over‐fitting, or PReLU if you have a huge training set.

Hands-on ML, page 332

# Setting up the Notebook

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd

import pathlib
import shutil
import tempfile

from IPython import display
from matplotlib import pyplot as plt
from sklearn.metrics import roc_auc_score

Setup tensorboard

In [None]:
logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)

# Load the TensorBoard notebook extension
%load_ext tensorboard

# Open an embedded TensorBoard viewer
%tensorboard --logdir {logdir}/models

function to plot the models training history ones training has completed.

In [None]:
from itertools import cycle
def plotter(history_hold, metric = 'binary_crossentropy', ylim=[0.0, 1.0]):
  cycol = cycle('bgrcmk')
  for name, item in history_hold.items():
    y_train = item.history[metric]
    y_val = item.history['val_' + metric]
    x_train = np.arange(0,len(y_val))

    c=next(cycol)

    plt.plot(x_train, y_train, c+'-', label=name+'_train')
    plt.plot(x_train, y_val, c+'--', label=name+'_val')

  plt.legend()
  plt.xlim([1, max(plt.xlim())])
  plt.ylim(ylim)
  plt.xlabel('Epoch')
  plt.ylabel(metric)
  plt.grid(True)

Create callbacks for tensorboard

In [None]:
m_histories = {}

def get_callbacks(name):
  return [
    tf.keras.callbacks.TensorBoard(logdir/name, histogram_freq=1),
  ]

# Load the dataset 

Lets load the dataset from the internet and set it up to be used with deep learning models

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!cp /content/drive/'My Drive'/COSC2779/Lectures/Week02/default_of_credit_card_clients.csv .
!ls

In [None]:
data=pd.read_csv("default_of_credit_card_clients.csv")

In [None]:
data.head()

# Explore Data

## Description of the dataset

This UCI dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. The dataset can be found at.

The dataset is composed by 24 variables in total. The first variables contains information about the user personal information:

- ID: ID of each client, categorical variable
- LIMIT_BAL: Amount of given credit in New Taiwan dollars (includes individual and family/supplementary credit)
- SEX: Gender, categorical variable (1=male, 2=female)
- EDUCATION: level of education, categorical variable (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- MARRIAGE: Marital status, categorical variable (1=married, 2=single, 3=others)
- AGE: Age in years, numerical variable 


Others variables contains information about the history of past payments, the following attributes track the past monthly payment records, i.e. the delay of the payment referred to a specific month:

- PAY_0: Repayment status in September 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_2: Repayment status in August 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_3: Repayment status in July 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_4: Repayment status in June 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_5: Repayment status in May 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_6: Repayment status in April 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

The following attributes instead consider the information related to the amount of bill statement, i.e. a monthly report that credit card companies issue to credit card holders in a specific month:

- BILL_AMT1: Amount of bill statement in September, 2005 (New Taiwan dollar)
- BILL_AMT2: Amount of bill statement in August, 2005 (New Taiwan dollar)
- BILL_AMT3: Amount of bill statement in July, 2005 (New Taiwan dollar)
- BILL_AMT4: Amount of bill statement in June, 2005 (New Taiwan dollar)
- BILL_AMT5: Amount of bill statement in May, 2005 (New Taiwan dollar)
- BILL_AMT6: Amount of bill statement in April, 2005 (New Taiwan dollar)

The last variables instead consider the amount of previous payment in a specific month:

- PAY_AMT1: Amount of previous payment in September, 2005 (New Taiwan dollar)
- PAY_AMT2: Amount of previous payment in August, 2005 (New Taiwan dollar)
- PAY_AMT3: Amount of previous payment in July, 2005 (New Taiwan dollar)
- PAY_AMT4: Amount of previous payment in June, 2005 (New Taiwan dollar)
- PAY_AMT5: Amount of previous payment in May, 2005 (New Taiwan dollar)
- PAY_AMT6: Amount of previous payment in April, 2005 (New Taiwan dollar)

The variable to predict is given by:

- default.payment.next.month: indicate whether the credit card holders are defaulters or non-defaulters (1=yes, 0=no)

## Data Distribution

In [None]:
plt.figure(figsize=(20,20))
for i, col in enumerate(data.columns):
    plt.subplot(5,5,i+1)
    plt.hist(data[col], alpha=0.3, color='b', density=True)
    plt.title(col)
    plt.xticks(rotation='vertical')

## Data cleaning

The data cleaning process is the procedure of correcting or removing incomplete/inaccurate or incorrect portions of the dataset. 

In [None]:
data[['LIMIT_BAL','SEX', 'EDUCATION', 'MARRIAGE', 'AGE']].describe()

In [None]:
summary = data['EDUCATION'].value_counts()
print(summary)

As far as EDUCATION is concerned there are three categories not listed in the description of the dataset provided by the UCI website that corresponds to 0, 5, and 6;

These rows are deleted

In [None]:
m = (data['EDUCATION'] == 0)|(data['EDUCATION'] == 6)|(data['EDUCATION'] == 5)
data = data.drop(data.EDUCATION[m].index.values, axis=0)
summary = data['EDUCATION'].value_counts()
print(summary)

for MARRIAGE category from the function .describe() it is possible to notice that there is a minimum value equal to 0, that does not corresponds to any category previously described.

In [None]:
summary = data['MARRIAGE'].value_counts()
print(summary)

In [None]:
m = (data['MARRIAGE'] == 0)
data = data.drop(data.MARRIAGE[m].index.values, axis=0)

summary = data['MARRIAGE'].value_counts()
print(summary)

In [None]:
data[['PAY_' + str(n) for n in [0, 2, 3, 4, 5, 6]]].describe()

As far as the attributes PAY_* is concerned, all of this attributes have a minimum value equal to -2, not included in the ranking. On the other hand the maximum value assumed is equal 8 so probabily it is necessary a re-scaling of the attribute. 

In [None]:
data[['PAY_' + str(n) for n in [0, 2, 3, 4, 5, 6]]] += 1

In [None]:
summary = data['MARRIAGE'].value_counts()
print(summary)

# Preprocessing



## One-hot encoding for categorical variables

Categorical variable such as SEX, MARRIAGE and EDUCATION are turned into one-hot variable in order to remove any orders that in this case have no meaning

In [None]:
data['EDUCATION'] = data['EDUCATION'].astype('category')
data['SEX'] = data['SEX'].astype('category')
data['MARRIAGE'] = data['MARRIAGE'].astype('category')

data=pd.concat([pd.get_dummies(data['EDUCATION'], prefix='EDUCATION'), 
                  pd.get_dummies(data['SEX'], prefix='SEX'), 
                  pd.get_dummies(data['MARRIAGE'], prefix='MARRIAGE'),
                  data],axis=1)
data.drop(['EDUCATION'],axis=1, inplace=True)
data.drop(['SEX'],axis=1, inplace=True)
data.drop(['MARRIAGE'],axis=1, inplace=True)
data.head()

## Min Max Scaling

Input variables may have different units so different scales; for this reason before drawing a boxplot, a MinMaxScaler() is applied in order to scale the features between a range (0, 1). The transformation is given by the following formula:

\begin{equation} X{std} = \frac{(X - X{min})}{(X{max} - X{min})} \end{equation}

\begin{equation} X{scaled} = X{std} * (max - min) + min \end{equation} Where:

$X_{min}$ is the minimum value on the column

$X_{max}$ is the maximum value on the column

$(min, max)$ are the extreme values of the range chosen, in this case $(0, 1)$

This transformation is applied on numerical features only as the categorical variables are transformed into one-hot vectors, that rescale the categorical variable in the range $(0,1)$.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data['LIMIT_BAL'] = scaler.fit_transform(data['LIMIT_BAL'].values.reshape(-1, 1))
data['AGE'] = scaler.fit_transform(data['AGE'].values.reshape(-1, 1))


for i in range(1,7):
    scaler = MinMaxScaler()
    data['BILL_AMT' + str(i)] = scaler.fit_transform(data['BILL_AMT' + str(i)].values.reshape(-1, 1))

for i in range(1,7):
    scaler = MinMaxScaler()
    data['PAY_AMT' + str(i)] = scaler.fit_transform(data['PAY_AMT' + str(i)].values.reshape(-1, 1))
    
for i in [0, 2, 3, 4, 5, 6]:
    scaler = MinMaxScaler()
    data['PAY_' + str(i)] = scaler.fit_transform(data['PAY_' + str(i)].values.reshape(-1, 1))

# Hold-out Validation

In hold out validation we divide the data into 3 subsets:
1. Training: to obtaining the parameters or the weights of the hypothesis
2. Validation: for tuning hyper-parameters and model selection.
3. Test: To evaluate the performance of the developed model. DO NOT use this split to set or tune ant element of the model.

For this example lets divide the data into 60/20/20

In [None]:
from sklearn.model_selection import train_test_split

with pd.option_context('mode.chained_assignment', None):
    train_data_, test_data = train_test_split(data, test_size=0.2, shuffle=True,random_state=0)
    
with pd.option_context('mode.chained_assignment', None):
    train_data, val_data = train_test_split(train_data_, test_size=0.25, shuffle=True,random_state=0)
    
print(train_data.shape[0], val_data.shape[0], test_data.shape[0])

In [None]:
train_X = train_data.drop(['default payment next month'], axis=1).to_numpy()
train_y = train_data[['default payment next month']].to_numpy()

test_X = test_data.drop(['default payment next month',], axis=1).to_numpy()
test_y = test_data[['default payment next month']].to_numpy()

val_X = val_data.drop(['default payment next month',], axis=1).to_numpy()
val_y = val_data[['default payment next month']].to_numpy()

## Manage dataset imbalancing

In machine learning, it is difficult to train an effective learning model if the class distribution in a given training data set is imbalanced. The overall accuracy may be high, but when computing the accuracy separatly for each class it is possible to notice that the percentage of data points that belongs to the minority class correcly classified is lower than the one computed over the majority class. To tackle this problem one can adopt two different strategies

- Oversampling the minority class;
- Undersampling the majority class. With the following code it is possible to verify that there is an high imbalance towords the class 0, which is present in almost the 80% of the dataset. This suggest to further adopt a tecnique in order to rebalance the classes.

In [None]:
l0 = len(train_y[train_y == 0])
l1 = len(train_y[train_y == 1])

print(str(np.round(l0/(l0+l1)*100))+ '%', str(np.round(l1/(l0+l1)*100))+ '%')

## SMOTE

The Synthetic Minority Oversampling Technique(SMOTE) generates synthetic points by introducing new observations taken along the line segments joining any or all the points that belong to the class to rebalance.

In [None]:
from imblearn.over_sampling import SMOTE , KMeansSMOTE

def oversample_dataset(X_train, y_train):

    s = f"<br>Number of instances in the training set before the rebalancing operation: {len(X_train)}"
    #oversample = SMOTE()
    oversample = KMeansSMOTE(cluster_balance_threshold=0.00001)
    X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)

    s += f"<br>Number of instances in the training set after the rebalancing operation: {len(X_train_smote)}"
    
    l0 = len(y_train_smote[y_train_smote == 0])
    l1 = len(y_train_smote[y_train_smote == 1])
    
    s += f"<br>There are {l0} rows labelled with 0 ({round(l0/(l1+l0)*100)}%), {l1} rows labelled with 1 ({round(l1/(l1+l0)*100)}%)"
    return X_train_smote, y_train_smote, s

In [None]:
X_train_balanced, y_train_balanced, s_balanced = oversample_dataset(X_train = train_X, y_train = train_y)

In [None]:
l0 = len(y_train_balanced[y_train_balanced == 0])
l1 = len(y_train_balanced[y_train_balanced == 1])


print(str(np.round(l0/(l0+l1)*100))+ '%', str(np.round(l1/(l0+l1)*100))+ '%')

# Data Modelling

## Perfomance measures

The metrics adopted to evaluate the performance of a classifier are the following:

- Accuracy score: it is the ratio of correct predictions (TP$^1$ + TN$^2$) over the total number of data points classified (TP + FN$^4$ + FP$^3$ + TN).
- Precision, or positive predictive value, is the number of TP divided by the total number of elements labelled with $1$ (TP + FP), it highlights how valid the results are;
- Recall: is the number of TP divided by the total number of elements that actually belong to the positive class (TP + FN), it shows how complete the predictions are;
- F-measure, that is the harmonic of precision and recall mean given by the following expression:
\begin{equation} F = 2 \cdot \frac{precision \cdot recall}{precision + recall} \end{equation}

As far as accuracy is concerned, in case of class imbalance, this metric may return an high score even if the minority class is not correcly classified. In our case the minority class is the positive one so precision, recall and f-measure are able to measure the goodness of the classifier [10]. In this analysis we focus our attention in detecting which customer may be defaults clients, and the positive class captures the attention of the classifier.

The Sklearn function classification_report() returns the results, in terms of accuracy, precision, recall, and f-measure for all the classes considered.

$^1$ TP, true positive = the number of items correctly labeled as belonging to the positive class

$^2$ TN, true negative = the number of items correctly labeled as belonging to the negative class

$^3$ FP, false positive = the number of items wrongly labeled as belonging to the positive class

$^2$ FN, false negative = the number of items wrongly labeled as belonging to the negative class



In [None]:
model1 = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(29, )),
  tf.keras.layers.Dense(128, activation='sigmoid'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])


model1.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', 'Precision', 
                       'Recall'])

In [None]:
m_histories['simple1H'] = model1.fit(X_train_balanced, y_train_balanced, 
                                    validation_data=(val_X, val_y), epochs=10, 
                                    verbose=0, 
                                    callbacks=get_callbacks('models/simple1H'))

In [None]:
plotter(m_histories, ylim=[0.0, 1.1], metric = 'accuracy')

In [None]:
test_loss, test_acc, test_Precision, test_Recall = model1.evaluate(test_X, test_y)
print('Test accuracy:', test_acc)
print('Test Precision:', test_Precision)
print('Test Recall:', test_Recall)



In [None]:
from sklearn.metrics import classification_report
test_y_pred = model1.predict(test_X)
print(classification_report(test_y, test_y_pred>.5))

**What should we do next?**

In [None]:
model2 = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(29, )),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.BatchNormalization(),
  tf.keras.layers.Dropout(rate=0.5),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.BatchNormalization(),
  tf.keras.layers.Dropout(rate=0.5),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.BatchNormalization(),
  tf.keras.layers.Dropout(rate=0.5),
  tf.keras.layers.Dense(1, activation='sigmoid')
])


model2.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', 'Precision', 
                       'Recall'])

m_histories['complex3H'] = model2.fit(X_train_balanced, y_train_balanced, 
                                    validation_data=(val_X, val_y), epochs=10, 
                                    verbose=0, 
                                    callbacks=get_callbacks('models/complex3H'))

plotter(m_histories, ylim=[0.0, 1.1], metric = 'accuracy')

In [None]:
test_y_pred = model2.predict(test_X)
print(classification_report(test_y, test_y_pred>.5))

# GPU vs CPU

Make sure you are using a GPU notebook instance. 
> Edit > notebook settings

In [None]:
import numpy as np
import timeit

A = np.random.rand(5000, 5000).astype(np.float32)
B = np.random.rand(5000, 5000).astype(np.float32)

timer = timeit.Timer("numpy.dot(A, B)",
"import numpy; from __main__ import A, B")
numpy_times_list = timer.repeat(10, 1)

print('Numpy mean time (s): ', np.mean(numpy_times_list))

In [None]:
import tensorflow as tf

A = tf.convert_to_tensor(A)
B = tf.convert_to_tensor(B)

timer = timeit.Timer("tensorflow.matmul(A, B)",
setup="import tensorflow; from __main__ import A, B")
tensorflow_times_list = timer.repeat(10, 1)

print('TensorFlow mean time (s): ', np.mean(tensorflow_times_list))