# Artificial Neural Networks (ANNs)
The fundamental neural network to deep learning. We will create a simple ANN using Keras in this lecture, but first let's learn about the essential mechanisms of an artificial neural network.

### Non-Neural Network Learning
When not using a neural network, then you just using a bunch of if statements to check if certain conditions are true then it helps determine a predicted value.

For instance, classifying an animal as a dog or cat. We would use if statements to check if the animal's ears are pointy or lopped, snout is small or large, etc.

### Neural Network Learning
When using a neural network, you program the network's architecture then it'll determine the necessary characteristics itself.

For instance, the neural network would learn the difference between a dog or cat and use what it learned to predict whether or not an animal is a dog or cat.

# Neuron (Node)
A Node that transmits data through layers.

<img src="images/ann/neuron.png" height="65%" width="65%"></img>
- The input values are standarized (and sometimes normalized)
- The output value can be continuous (number), binary (yes or no), or categorical (dummy variables)

The inputs are transmitted through neuron(s), which are processed to determine an output value.

In real life, the input values of a human are the 5 senses: sight, touch, sound, taste, and smell. And these input values are transmitted to the neurons, which the neurons will process and determine an output.

For example, if I touch a fire with my hand, then my touch input value will signal the neuron, and then the neuron will process it and determine that I need to take my hand away from the fire.

### Weights
For each synapse (signal), there can be weights to measure the significance of a signal. Weights are crucial, and they're the values that get adjusted across the neural network. This is where gradient descent and backpropagation come into play, but we'll get to that later.

<img src="images/ann/weights_neuron.png" height="75%" width="75%"></img>
- W1, W2, and Wm are the individual weights for each synapse (arrow, or signal)

The mathematics inside the neuron is a value that determines whether or not to pass the signal to the next neuron in the next layer. This value is determined through an activation function that is applied on the weighted sum of the input values.

# Activation Function
Adds a bias on the weighted sum of the independent values. There are many types of activation functions, and some work better than others depending on the neural network.

Note that the input values must be featured scaled (standarized or normalized) for these functions to work properly.

Below are different types of activation functions.
- The x-value is the weighted sum of the input value(s)
- The y-value is the neuron's contribution to the output value(s)

### 1. Threshold Function
A "binary" (yes or no) activation function.

<img src="images/ann/threshold_function.png" height="50%" width="50%"></img>

### 2. Sigmoid Function
A smooth activation function with gradual progression, very useful for the output layer to predict the probability of success.

<img src="images/ann/sigmoid_function.png" height="50%" width="50%"></img>

### 3. Rectifier Function
One of the most popular functions, a linear curve that increases after the x-value of 0.

<img src="images/ann/rectifier_function.png" height="50%" width="50%"></img>

### 4. Hyperbolic Tangent (tanh)
Similar to the sigmoid function, but the function's value can be a negative.

<img src="images/ann/tanh_function.png" height="50%" width="50%"></img>

# How Do Neural Networks Work?
Let's learn how neural networks actually work.

### Shallow Neural Network
In machine learning algorithms without deep learning, the algorithm can be modelled below.

<img src="images/ann/basic_neural_network.png" height="50%" width="50%"></img>

This neural network is very basic: there are only independent variables (input layer), parameter tuning variables (weights), and a dependent variable (output layer). This is actually how most machine learning models work if there is no deep learning involved.

Fortunately, in deep learning, there are "hidden" layers that increase the accuracy of the model.

### Deep Neural Network
A deep neural network has "hidden" layers that process the input values further. Let's assume a neural network has already been trained, so let's observe how it will work.

The neural network below is trying to predict the price of a house based on area, bedrooms, distance to city, and age.

<img src="images/ann/neural_network_house_price.png" height="50%" width="50%"></img>

Each neuron in the hidden layer only accepts only some input values because of the weights from the synapses (signals) to calculate whether or not a signal is significant enough for the neuron.

For example, the middle neuron in the hidden layer focuses on only the "Area", "Bedrooms", and "Age" input values. Maybe because the already trained neuron determined that younger people prefer high area and lots of bedrooms, so it only accepts the signals from those input values to determine if the criterions are met.

Another example is the last neuron in the hidden layer that focuses on only the "Age". Maybe because the neuron determined that a house older than 100+ years is priced significantly higher due to historical reasons. This is a good example of when to use the rectifier activiation function because the neuron would check if the age is 100+ then the neuron's contribution to the output increases and if not then the neuron's contribution to the output is 0.

Together, all the neurons can be used to predict the price of a house as seen in the output layer.

# Propagations
There are two types of propagations: front and back propagation. These propagations are necessary in order for the neural network to learn the trends of the data set.

Let's say we're trying to determine a person's exam score based on hos or her study hours, sleep hours, and quiz score. 

### Forward Propagation
<img src="images/ann/forward_propagation.png" height="50%" width="50%"></img>
- The output (predicted) exam score is noted as y^ and the actual exam score is noted as y

In forward propagation, the neural network predicts an output (predicted) value. The neural network uses a cost function (C) to compare the output to the actual value.

### Back Propagation
<img src="images/ann/back_propagation.png" height="50%" width="50%"></img>

Then using the cost function, the network signals a back propagation to update the weights of the synapses.

### Epoch
An epoch is when a forward and back propagation occurs in a neural network. The goal is to minimize the cost function, so we must perform multiple epochs to better learn the trends of the data set.

However, too many epochs may cause overfitting of the data set. It means that your model does not learn the data, it memorizes the data. To avoid overfitting, early stop the model once the validation accuracy flattens out, or it starts decreasing.

# Gradient Descent
Now that we understand that back propagation sends a signal back to the neurons to update the synapse weights, we also need to understand how the weights are actually adjusted.

The goal is to minimize the cost function, so which weight values could accomplish that?

### Brute Force Approach
If we decided to brute force and guess the weights and there are too many input values, then it would be inefficient because there's too many combinations to compute (curse of dimensionality).

### Gradient Descent Approach
Graph the cost function where C is the dependent variable, y^ is the independent variable, and y is the vertex.

The goal is to get the minimal cost, also known as the vertex (actual value) of the graph. In order to get closer to the minimal cost, receive the derivative (slope) at each point of the current cost to determine the direction (go right if negative, left if positive) to descent the cost.

<img src="images/ann/gradient_descent_1.png" height="30%" width="30%"></img>

Roll the cost ball to the right because it's a negative derivative.
<hr>

<img src="images/ann/gradient_descent_2.png" height="30%" width="30%"></img>

The cost ball rolled to the right. Now we need to roll the cost ball to the left because it's a positive derivative.
<hr>

<img src="images/ann/gradient_descent_3.png" height="30%" width="30%"></img>

The cost ball rolled to the left. Now we need to roll the cost ball to the right because it's a negative derivative.
<hr>

<img src="images/ann/gradient_descent_4.png" height="30%" width="30%"></img>

The cost ball is now at the minimal cost. We determined the best weights to the neural network!

### Partial Derivatives of Gradient Descent
What if there were multiple outputs in the neural network aside from just y? How would the cost function work? To handle this, the gradient descent algorithm calculates the partial derivatives of the individual variables.

# Stochastic Gradient Descent
In a convex function like the quadratic Cost function, there is a single minimum. However, if we used a different Cost function that had multiple local minimums, then the Gradient Descent algorithm might not descent to the global minimum but instead to a local minimum.

<img src="images/ann/gradient_descent_problem.png" height="30%" width="30%"></img>

Notice how the cost ball is at a local minimum, but not the best global minimum. Therefore, we solve this problem by using the Stochastic Gradient Descent!

### Comparing The Two Gradient Descents
<img src="images/ann/batch_vs_stochastic_gradient_descent.png" height="75%" width="75%"></img>

For the standard Gradient Descent, the algorithm calculates the Cost function by summing the whole "batch" (all the rows) of the data set, then it applies the Gradient Descent on the weights. This is an example of Batch Learning!

For the Stochastic Gradient Descent, the algorithm calculates the Cost function per each row, then it applies Gradient Descecent on the weights. Basically, it updates the weights one row at a time. This is an example of Reinforcement Learning!

### Advantages of Stochastic Gradient Descent
1. Finds the global minimum instead of local minimum.  
2. It's actually faster because it does not have to load all the data in the memory, much lighter and faster.

### Disadvantages of Stochastic Gradient Descent
1. The rows may be picked at random, so the neural network is updated at a stochastic (random) manner. Therefore, the number of epochs to minimize the Cost function is also random.

# Training Algorithm
Let's put all the mechanisms together to develop a training algorithm for the artificial neural network.

1. Randomly initialize weights to small numbers close to 0.  
2. Input the independent variables into the input layer.  
3. Perform forward propagation, then measure the Cost function.  
4. Perform back propagation, then use the Cost function to update the weights with reinforcement learning (stochastic) or batch learning.
5. Redo more epochs, early stop once the validation accuracy flattens out to prevent overfitting.