# Artificial Neural Networks (ANNs)
The fundamental neural network to deep learning. We will create a simple Dense ANN using Keras in this lecture, but first let's learn about the essential mechanisms of an artificial neural network.

### Non-Neural Network Learning
When not using a neural network, then you just using a bunch of if statements to check if certain conditions are true then it helps determine a predicted value.

For instance, classifying an animal as a dog or cat. We would use if statements to check if the animal's ears are pointy or lopped, snout is small or large, etc.

### Neural Network Learning
When using a neural network, you program the network's architecture then it'll determine the necessary characteristics itself.

For instance, the neural network would learn the difference between a dog or cat and use what it learned to predict whether or not an animal is a dog or cat.

# Neurons and Layers
A neuron is a Node that transmits data through layers.  
A layer is a collection of Nodes together at the same depth.

<img src="images/ann/neuron.png" height="65%" width="65%"></img>
- There are 3 layers: input layer, hidden layer (middle), and output layer.

The input value(s) are standarized (and sometimes normalized).  
The output value(s) can be continuous (number), binary (yes or no), or categorical (dummy variables).

The input value(s) are transmitted through neuron(s), which are processed throughout the layers to determine output value(s).

In real life, the input values of a human are the 5 senses: sight, touch, sound, taste, and smell. And these input values are transmitted to the neurons, which the neurons will process and determine an output. For example, if I touch a fire with my hand, then my touch input value will signal the neuron, and then the neuron will process it and determine that I need to take my hand away from the fire.

### Weights
For each synapse (signal), there can be weights to measure the significance of a signal. Weights are crucial, and they're the values that get adjusted across the neural network. This is where gradient descent and backpropagation come into play, but we'll get to that later.

<img src="images/ann/weights_neuron.png" height="75%" width="75%"></img>
- W1, W2, and Wm are the individual weights for each synapse (arrow, or signal)

The formula inside the neuron is a value that determines how significant the synapse is to the next neuron in the layer.

# Activation Function
The activation function is a function that normalizes the weighted sum of the input values. There are many types of activation functions, and some work better than others depending on the neural network.

The input values must be featured scaled (standarized or normalized) for these functions to work properly.

Below are different types of activation functions.
- The x-value is the weighted sum of the input value(s)
- The y-value is the neuron's contribution to the output value(s)

### 1. Threshold Function
A "binary" (yes or no) activation function.

<img src="images/ann/threshold_function.png" height="50%" width="50%"></img>

### 2. Sigmoid Function
A smooth activation function with gradual progression, very useful for the output layer to predict the probability of success.

<img src="images/ann/sigmoid_function.png" height="50%" width="50%"></img>

### 3. Rectifier Function
One of the most popular functions, a linear curve that increases after the x-value of 0.

<img src="images/ann/rectifier_function.png" height="50%" width="50%"></img>

### 4. Hyperbolic Tangent (tanh)
Similar to the sigmoid function, but the function's value can be a negative.

<img src="images/ann/tanh_function.png" height="50%" width="50%"></img>

# How Do Neural Networks Work?
Let's learn how neural networks actually work.

### Shallow Neural Network
In machine learning algorithms without deep learning, the algorithm can be modelled below.

<img src="images/ann/basic_neural_network.png" height="50%" width="50%"></img>

This neural network is very basic: there are only independent variables (input layer), parameter tuning variables (weights), and a dependent variable (output layer). This is actually how most machine learning models work if there is no deep learning involved.

Fortunately, in deep learning, there are "hidden" layers that increase the accuracy of the model. A "hidden" layer is any layer in-between the input and output layers of the neural network.

### Deep Neural Network
A deep neural network has "hidden" layers that process the input values further. Let's assume a neural network has already been trained, so let's observe how it will work.

The neural network below is trying to predict the price of a house based on area, bedrooms, distance to city, and age.

<img src="images/ann/neural_network_house_price.png" height="50%" width="50%"></img>

Each neuron in the hidden layer only accepts only some input values because of the weights from the synapses (signals) to calculate whether or not a signal is significant enough for the neuron.

For example, the middle neuron in the hidden layer focuses on only the "Area", "Bedrooms", and "Age" input values. Maybe because the already trained neuron determined that younger people prefer high area and lots of bedrooms, so only those neurons are significant enough to impact the middle neuron.

Another example is the last neuron in the hidden layer that focuses on only the "Age". Maybe because the neuron determined that a house older than 100+ years is priced significantly higher due to historical reasons. This is a good example of when to use the rectifier activiation function because the neuron would check if the age is 100+ then the neuron's contribution to the output increases and if not then the neuron's contribution to the output is 0.

Together, all the neurons can be used to predict the price of a house as seen in the output layer.

# Propagations
There are two types of propagations: front and back propagation. These propagations are necessary in order for the neural network to learn the trends of the data set.

Let's say we're trying to determine a person's exam score based on hos or her study hours, sleep hours, and quiz score. 

### Forward Propagation
<img src="images/ann/forward_propagation.png" height="50%" width="50%"></img>
- The output (predicted) exam score is noted as y^ and the actual exam score is noted as y

In forward propagation, the neural network predicts an output (predicted) value. The neural network uses a Cost function (the Mean-Squared Error function) to compare the predicted to the actual value.

### Back Propagation
<img src="images/ann/back_propagation.png" height="50%" width="50%"></img>

Then using the cost function, the network signals a back propagation to update the weights of the synapses.

### Epoch
One epoch is when the entire data set is forward and back propgagated. The goal is to minimize the cost function, so we must perform multiple epochs to better learn the trends of the data set.

However, too many epochs may cause overfitting of the data set. It means that your model does not learn the data, it memorizes the data. To avoid overfitting, early stop the model once the validation accuracy flattens out, or it starts decreasing.

# Gradient Descent
Now that we understand that back propagation sends a signal back to the neurons to update the synapse weights, we also need to understand how the weights are actually adjusted.

The goal is to minimize the cost function, so which weight values could accomplish that?

### Brute Force Approach
If we decided to brute force and guess the weights and there are too many input values, then it would be inefficient because there's too many combinations to compute (curse of dimensionality).

### Gradient Descent Approach
Graph the cost function where C is the dependent variable, y^ is the independent variable, and y is the vertex.

The goal is to get the minimal cost, also known as the vertex (actual value) of the graph. In order to get closer to the minimal cost, receive the derivative (slope) at each point of the current cost to determine the direction (go right if negative, left if positive) to descent the cost.

<img src="images/ann/gradient_descent_1.png" height="30%" width="30%"></img>

Roll the cost ball to the right because it's a negative derivative.
<hr>

<img src="images/ann/gradient_descent_2.png" height="30%" width="30%"></img>

The cost ball rolled to the right. Now we need to roll the cost ball to the left because it's a positive derivative.
<hr>

<img src="images/ann/gradient_descent_3.png" height="30%" width="30%"></img>

The cost ball rolled to the left. Now we need to roll the cost ball to the right because it's a negative derivative.
<hr>

<img src="images/ann/gradient_descent_4.png" height="30%" width="30%"></img>

The cost ball is now at the minimal cost. We determined the best weights to the neural network!

### Partial Derivatives of Gradient Descent
What if there were multiple outputs in the neural network aside from just y? How would the cost function work? To handle this, the gradient descent algorithm calculates the partial derivatives of the individual variables.

# Stochastic Gradient Descent
In a convex function like the quadratic Cost function, there is a single minimum. However, if we used a different Cost function that had multiple local minimums, then the Gradient Descent algorithm might not descent to the global minimum but instead to a local minimum.

<img src="images/ann/gradient_descent_problem.png" height="30%" width="30%"></img>

Notice how the cost ball is at a local minimum, but not the best global minimum. Therefore, we solve this problem by using the Stochastic Gradient Descent!

### Comparing The Two Gradient Descents
<img src="images/ann/batch_vs_stochastic_gradient_descent.png" height="75%" width="75%"></img>

For the Standard Gradient Descent, the algorithm calculates the Cost function by summing the whole "batch" (all the rows) of the data set, then it applies the Gradient Descent on the weights. This is an example of Batch Learning!

For the Stochastic Gradient Descent, the algorithm calculates the Cost function per each row, then it applies Gradient Descecent on the weights. Basically, it updates the weights one row at a time. This is an example of Reinforcement Learning!

### Advantages of Stochastic Gradient Descent
1. Finds the global minimum instead of local minimum.  
2. It's actually faster because it does not have to load all the data in the memory, much lighter and faster.

### Disadvantages of Stochastic Gradient Descent
1. The rows may be picked at random, so the neural network is updated at a stochastic (random) manner. Therefore, the number of epochs to minimize the Cost function is also random.

### Mini-Batch Gradient Descent
Mini-Batch Gradient Descent combines the ideas of Standard and Stochastic Gradient Descent.

A batch are the rows (samples) in one forward and back propagation. In Stochastic Gradient Descent the batch size is 1 and in Standard Gradient Descent the batch size is the entire data set.

Mini-Batch Gradient Descent specifies the batch size (number of samples) parameter to perform a forward and back propagation, thus combining the ideas of the two algorithms. We will use Mini-Batch Gradient Descent for this lecture.

# Training Algorithm
Let's put all the mechanisms together to develop a training algorithm for the artificial neural network.

1. Randomly initialize weights to small numbers close to 0.  
2. Input the independent variables into the input layer.  
3. Perform forward propagation, then measure the Cost function.  
4. Perform back propagation, then use the Cost function to update the weights with reinforcement learning (stochastic) or batch learning.
5. Redo more epochs, early stop once the validation accuracy flattens out to prevent overfitting.

In [97]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [98]:
# import the data set, classify if a customer will exit (1) or will not exit (0)
churn_df = pd.read_csv("datasets/ann/churn_modelling.csv")

churn_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [99]:
# x is the columns from column indexes from 3 to 12
x = churn_df.iloc[:, 3:13].values

# y is the Exited column
y = churn_df.iloc[:, 13].values

In [100]:
# import the encoders and column transformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# label encode the Gender column, no need to one hot encode because it has only 2 categories
label_encoder_geography = LabelEncoder()
x[:, 2] = label_encoder.fit_transform(x[:, 2])

# one hot encode the Geography column
gender_column_index = 1
transformer = ColumnTransformer(
    [("one_hot_encoder", OneHotEncoder(categories="auto"), [gender_column_index])],
    remainder="passthrough"
)
x = transformer.fit_transform(x)

# avoid the dummy variable trap from the encoded Geography column
x = x[:, 1:]

In [101]:
# split the data set into training and testing data sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

In [102]:
# import a Standarization Scaler for Feature Scaling
from sklearn.preprocessing import StandardScaler

# feature scale the training and testing sets
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)



# Artificial Neural Network Model
Now that we pre-processed the data set, let's begin programming the ANN!

In [103]:
# import keras, a high-level API to create neural networks using a tensorflow backend
import keras

# import the sequential model, a standard model to create a neural network
from keras.models import Sequential

# import the dense layer, a regularly connected neural network layer
from keras.layers import Dense

In [104]:
# initialize the neural network as a sequence of layers
classifier = Sequential()

In [105]:
"""
add the input layer and the first hidden layer
- units = 6 means that there are 6 neurons in this hidden layer;
    this was determined by a general tip where you get the average of the number
    of neurons in the input and output layer, which is (11 + 1) / 2 = 6
- kernel_initializer = uniform initializes the weights to small numbers
    close to 0 in a uniform distribution
- activation = relu stands for the rectifier activation function
- input_dim = 11 means that there are 11 neurons (columns or indep. variables) in the input layer
"""
classifier.add(
    Dense(
        units=6,
        kernel_initializer="uniform",
        activation="relu",
        input_dim=11
    )
)

In [106]:
# add the second hidden layer using the same parameters as the first hidden layer
classifier.add(
    Dense(
        units=6,
        kernel_initializer="uniform",
        activation="relu"
    )
)

In [107]:
"""
add the output layer using a sigmoid activation function for a single output neuron

If you wish to have multiple output neurons, then change the units parameter to the number
of output neurons you wish to have and change the activation function to "softmax" because
the "sigmoid" activation function only works best with a single output.
"""
classifier.add(
    Dense(
        units=1,
        kernel_initializer="uniform",
        activation="sigmoid"
    )
)

In [111]:
"""
compile the ANN
- optimizer = adam is a specific type of Stochastic Gradient Descent algorithm
- loss = binary_crossentropy because we're using a sigmoid activation for the output layer
- metric = accuracy means to use the accuracy metric to determine how accurate the model is
"""
classifier.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [115]:
"""
fit the ANN using 10 epochs and update the weights after every 10 rows (batch)

The greater the accuracy, the smaller the loss because the loss function is receiving less errors.
It's a good idea to early stop training the neural network once the loss begins to increase
"""
classifier.fit(x_train, y_train, batch_size = 10, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb518406390>

In [116]:
# predict the testing set
y_pred = classifier.predict(x_test)
y_pred = y_pred > 0.5

# Confusion Matrix

In [117]:
# import the confusion matrix function
from sklearn.metrics import confusion_matrix

In [119]:
# create a confusion matrix that compares the y_test (actual) to the y_pred (prediction)
cm = confusion_matrix(y_test, y_pred)

"""
Read the Confusion Matrix diagonally:
1489 + 219 = 1708 correct predictions
106 + 186 = 292 incorrect predictions
"""
cm

array([[1489,  106],
       [ 186,  219]])