# Backward Propagation in Densely Connected Feed-Forward Artificial Neural Networks

<center><img src="Images/nn_image.jpg"> </center>
    
[image by fdecomite](https://www.flickr.com/photos/fdecomite/3238821080)

## Review:

In the last lesson, you learned about the general structure of artificial neural networks (ANNs) and how data flows through the model, processed by each layer.

**Neural networks** are like collections of interconnected **neurons**, separated into **layers**. **Neurons** in each layer process data **simultaneously** in different ways, while the **layers** process data **sequentially**. Each layer's **outputs** become the next layer's **inputs**, culminating in the final layer's output as the **model's prediction**.

This lesson focuses on a specific type of ANN: the **densely connected feed-forward network**, also known as a multi-layer perceptron or just a dense neural network.

#### Vocabulary:
* **Artificial Neural Network (ANN)**: A collection of neurons arranged in layers, processing data to make predictions and learning from its prediction errors.
* **Neuron**: The basic processing unit of an ANN. Imagine it as a linear regression model with an activation function applied to its output. Each neuron has weights that multiply each input value and a bias that sums these products.
* **Parameters**: The **weights and biases** of a model.  These determine how data is processed toward a prediction and are updated, or tuned, during backward propagation to reduce model errors.
* **Layer**: A collection of neurons processing data simultaneously. Each layer passes the outputs of all its neurons to the next. In a densely connected network, each neuron in a layer receives output from every neuron in the previous layer.
* **Activation Function**: A function applied to a neuron's output to introduce non-linearity. This is crucial for the network to learn complex relationships between input features and targets.
* **Forward Propagation**: The process of a model ingesting data, processing it through each layer sequentially (weighted sums + bias), and finally making a prediction. This step also calculates the prediction error using a loss function.
* **Backward Propagation**: The process of adjusting weights and biases using gradient descent to minimize the model's prediction error.
* **Gradient Descent**: A technique for calculating the steepest downhill path in a multi-dimensional error surface, guiding the network's weight and bias adjustments to reduce prediction error.
* **Loss and Loss Function**: The loss of a model is its prediction error.  The loss function is the function that calculates that error, for example, mean squared error might be used for a regression model or binary cross-entropy might be used for a classification model.
* **Epoch**: One cycle of a forward propagation and backward propagation step.  A model may be trained for few or many epochs, but ultimately the number of epochs is determined by creator.

#### Real-World Applications: 
Dense neural networks are used in various domains to make regression and classification predictions, similar to more traditional machine learning models, but can often pick up much more complex signals in the data.  They are also often used as sub-components in more complex kinds of deep learning models to achieve tasks like image recognition, machine translation, and stock market prediction.

#### Next Steps: 

We'll dive deeper into backpropagation, exploring how it uses gradient descent to fine-tune the network's parameters and improve its performance.

## How Models Learn from Mistakes: **Gradient Descent**

Once a model has made a prediction, how does it learn from its errors and improve its performance? The key algorithm here is gradient descent.

### Key Concepts:

* **Loss Function:** A function to quantify the difference between predictions and true targets or labels to measure how well a model is performing.
* **Loss:** A metric to measure model error.  Lower loss should always mean less error.  This means not all metrics are appropriate for a loss function (such as accuracy or F1-score).
* **Loss Landscape:** The loss function creates a multidimensional landscape, where each point represents a different combination of parameter values and its associated loss.
* **Global Minimum:** The lowest point in the entire landscape, representing the model's optimal configuration.
* **Local Minima:** Valleys in the landscape that aren't the global minimum. The model might get stuck in these, preventing it from reaching its full potential.
* **Chain Rule:** In neural networks, the chain rule efficiently calculates gradients through multiple layers during backpropagation.

### Gradient Descent Explained

Imagine a ball rolling downhill, seeking the lowest point in a landscape. **Gradient descent** operates similarly, guiding a model's parameters (weights and biases) towards the lowest loss in its loss landscape.
The **loss function** represents the altitude in this landscape, and the parameters, determine the ball's position laterally.  This landscape is referred to as the **Loss Landscape**
**Gradient descent** calculates the gradient of the loss function, which points uphill. The model then takes a step in the opposite direction (downhill) to reduce loss.

In the animation below you can see how models with different parameter values, represented by different balls, will take different paths toward minimizing loss.  In fact, you can see that the one model at the back does not end up finding the **global minimum**, but gets stuck in a **local minimum** where no change change in weights equal to the **learning rate** step size will result in a reduction of loss, even though there does exist a better combination of parameter values.  Situations like this can be navigated or mitigated by adjusting the **learning rate**.

![Gradient Descient Gif](Images/Gradient_descent.gif)

Image by: [Jacopo Bertolotti](https://commons.wikimedia.org/wiki/User:Berto)



### Mathematical Details:

**Gradient descent update rule**: θ^(t+1) = θ^(t) - η * ∇_θL

* ∇_θL represents the gradient of the loss function with respect to a parameters (weight or bias).
* η is the learning rate (step size for each change).

## Backward Propagation

After calculating the gradient of the loss function with respect to the output layer, we use the **chain rule** to propagate the changes backward through the layers.

**Chain rule for backpropagation**: ∂L/∂θ_ij = ∂L/∂a_j * ∂a_j/∂z_j * ∂z_j/∂θ_ij (for a 3-layer network with layers A, B, and C)
* a_j is the output of neuron j in layer A.
* z_j is the weighted sum of inputs to neuron j.


### Conclusion

Gradient descent is fundamental to training various machine learning models, including neural networks. By understanding its mechanics, we gain deeper insights into how models learn and optimize their parameters to achieve better performance.

#### Next Steps

Next we will walk through a forward and backward step in a neural network using one sample from the Iris Dataset.  This will show you what a complete model epoch looks like.

![forward and backward propagation](Images/forward_and_back_propagation.png)

[Image Source](https://www.enjoyalgorithms.com/blog/forward-propagation-in-neural-networks)

# Completing an ANN model in Code Using NumPy

In this section we will create a binary classification neural network and test it on one sample.  In a real application we would create a model that can make predictions on, and learn from, many samples simulaneously.  However, for simplicity, we will start with just here.

To gauge our model's ability to learn, we want the model's predicted probability to be as close to the true class of a sample as possible.  We want to see the model's predicted probabilities to get closer to the true label with each epoch of learning.

## Batch Gradient Descent

In the previous lesson you saw how a model propagates a single training instance forward though it's layers.  This method can be used to perform a forward and backward propagation step for each sample in the training set.  This method is called **Stochastic Gradient Descent**.  It's best practice to sample the training set randomly, with replacement, for this approach.

However, in this lesson we will take a different approach: **Batch Gradient Descent**.  In **Batch Gradient Descent** The model trains on all training samples simultaneously.  This tends to be faster, and generates a smoother learning curve for the model, whereas **Stochastic Gradient Descent** can generate noisy, erratic learning.

The challenge with **Batch Gradient Descent** is managing the shapes of arrays.  There are many matrix multiplication operations and many different arrays to keep track of.  When creating this kind of model it's important to check the shape of your arrays after each step to ensure they are compatible.

However, for this lesson, don't get too caught up in that.  Pay attention to how information flows forward, derivatives are calculated and passed backward through layers of the model using the **Chain Rule** to calculate the gradients to update each weight and bias.  The most important takeaway here is the big picture.

## Steps:
1. Gather and prepare the data.
2. Initialize our neurons in the hidden and output layers.
3. Use a forward propagation pass to make a model prediction.
4. Calculate the loss using binary cross-entropy.
5. Calculate and combine the derivatives of the loss function and final activation functions.
6. Propagate the gradient backward from output layer to the hidden layer update the weights.
7. Evaluate the tuned model.

## Data

Once again we will use one of the flowers in the Iris dataset to demonstrate how a model would make a prediction on a single sample, then how it will apply **gradient descent** to backward propagation changes to neuron parameters to reduce the loss in the next epoch.

![Iris Image](Images/Iris.JPG)

## 1. Prepare Data

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
np.random.seed(42)

## Load the data
iris = load_iris()
X = iris.data
y = iris.target

print(X[:5])
print(y[:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]


### Scale the Data

We will scale the data between 0 and 1 with min-max scaling.  We will subtract the minimum value of each feature from the values of that feature and then divide them by the maximum value.

While not required, neural networks have been shown experimentally to perform better with input data with absolute values less than 1.

In [2]:
# Define a min/max scaling function
def min_max_scale(features):
    min = np.min(features)
    max = np.max(features)
    features = (features - min) / max
    return features

# Scale the features
X_sc = min_max_scale(X)
X_sc[:5]

array([[0.63291139, 0.43037975, 0.16455696, 0.01265823],
       [0.60759494, 0.36708861, 0.16455696, 0.01265823],
       [0.58227848, 0.39240506, 0.15189873, 0.01265823],
       [0.56962025, 0.37974684, 0.17721519, 0.01265823],
       [0.62025316, 0.44303797, 0.16455696, 0.01265823]])

We will create a simple network with two neurons that each take as input the 4 features of this dataset, multiplies them by a coefficient, sums them together with a bias, applies a sigmoid activation, sums the results together, and outputs a model prediction.  Since the model will not yet have been trained, the prediction will be essentially random.

## 2. Initialize the Hidden and Output Layers

In this step we will define a function that will create two layers for us, a hidden layer and an output layer.  Our function take the number of neurons and the activation function as arguments and output the weights and biases of the layer.

We will combine the code from the previous lesson into a few functions for convenience.

1. `initialize_neuron()` to create some randomized weights and biases for us.
2. `intialize_layer()` To combine the neurons into layers.  The resulting layers will be of the form `[*nodes, activation]` where `*nodes` are all of the nodes for that layer and the activation function is the last element of the list.
3. Some activation functions, which also include derivative variations to help with backward propagation.

#### Activation Function

In order to allow our model to learn non-linear functions, we will apply an activation function to each neuron's output. In this case, we will use a sigmoid activation function that confines the result to between 0 and 1.  Later, you will learn about other activation functions, but this is the one we will use today.

We are also going to create a linear activation function, which is the identity function, for completeness.  However, we will not use it in this demonstration.

### Initialization Functions

We will create helper functions, `initialize_neuron()` and `sigmoid()`, `linear()`, and `initialize_layer()` to help us build each layer of the model.

In [3]:
## Function to initialize a neuron
def initialize_neuron(n_weights, zero_bias=True):
    """Returns a tuple of (weights, bias)
    The weights will be a numpy array of size n_weights 
    with each value a random number taken from a normal distribution
    If zero_bias == True, the bias will be zero
    Otherwise it will be taken from the same distribution as the weights"""
    weights = np.random.randn(1,n_weights)
    if zero_bias:
        bias = 0
    else:
        bias = np.random.randn(1)
    return weights, bias

## Sigmoid activation function
def sigmoid(z, derivative=False):
  """Apply a sigmoid function"""
  if derivative:
      return sigmoid(z) * (1-sigmoid(z))
  else:
      return 1 / (1 + np.exp(-z))

## Default activation function
def linear(z, derivative=False):
    """linear activation, equivalent to the identify function"""
    if derivative:
        return 1
    else:
        return z

## Function to initialize a layer.
def initialize_layer(n_neurons, n_weights, zero_bias=True, activation=linear):
    """Returns a list of neurons and an activation function.  Each neuron will be a tuple of (weights, bias)
    If an activation function is passed it will be applied, otherwise an idendity function will be applied."""
    layer = [initialize_neuron(n_weights=n_weights, zero_bias=zero_bias) for _ in range(n_neurons)]
    layer.append(activation)
    return layer

### Input Layer

The features themselves make up the input layer.  No weights or biases are necessary and we will use the data itself for this layer.  However, it is important to keep in mind that that 4 values (features) will be passed to the hidden layer.  This determines the number of weights for each node in that layer.

### Hidden Layer

We will create a hidden layer with 2 neurons and 4 weights per neuron.  In a densely connected model each neuron should have a weight for every incoming input.  Since the input layer is sending in 4 values (the features of the data), each neuron needs 4 weights.

In [4]:
## Initialize the hidden layer
hidden_layer = initialize_layer(n_neurons=2, n_weights=X.shape[1], zero_bias=True, activation=sigmoid)
hidden_layer

[(array([[ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986]]), 0),
 (array([[-0.23415337, -0.23413696,  1.57921282,  0.76743473]]), 0),
 <function __main__.sigmoid(z, derivative=False)>]

### Output Layer

Our output layer will be one neuron because we want the model to make a single value prediction for each sample.  In other situations we might want an output layer with multiple neurons if we wanted the model to output multiple values, such as with multiclass classification or some applications of generative AI.

Since the hidden layer has 2 neurons and each neuron will send one value, the neuron in the output layer should contain 2 weights, one for each incoming value.

In [5]:
## Initialize the output layer
output_layer = initialize_layer(n_neurons=1, n_weights=len(hidden_layer)-1, zero_bias=True, activation=sigmoid)
output_layer

[(array([[-0.46947439,  0.54256004]]), 0),
 <function __main__.sigmoid(z, derivative=False)>]

### Model

Let's put the layers together into a model, which will be a list of the layers and activations

In [6]:
model = [hidden_layer, output_layer]
model

[[(array([[ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986]]), 0),
  (array([[-0.23415337, -0.23413696,  1.57921282,  0.76743473]]), 0),
  <function __main__.sigmoid(z, derivative=False)>],
 [(array([[-0.46947439,  0.54256004]]), 0),
  <function __main__.sigmoid(z, derivative=False)>]]

![4-2-1 network](Images/4-2-1%20network.png)

## 3. Forward Propagation

In this step we will use the layers we created to process the data from the input features.  First the weights and biases of the hidden layer neurons will be applied, then the resulting outputs of those neurons will be passed to the output layer.  Finally, a model prediction will be created.  This prediction will the model's estimated probability of the sample belonging to class 1.

To help us with this step, we will define a forward pass function that will apply each layer and it's activation function in sequence to produce a model output.  

The forward pass function will also store the outputs of each neurons both before and after the activation functions in lists.  These will be used later during backward propagation.

Traditionally, the outputs of neurons before before activation are labeled as 'z' and after activations are labeled 'a'.  We will use these conventions to build our dictionary.

Let's walk through this:

Our model is represented as a list of layers.  Each layer is a list of neurons and an activation function.  Each neuron is a tuple of an array of weights and a bias.

### Functions

`forward_neuron()` will apply the weights and bias to the neuron inputs.

`forward_layer()` will apply the neurons of each layer to the input and store the results.  Then it will apply the activation function to the outputs of each neuron and store the results of that.  It returns a tuple of activations As (after activation function) and pre-activation function neuron outputs, Zs.

`forward_model()` will apply the layers of the model to the inputs in order, with the activations of one layer being the inputs of the next layer.  It will store the activations and the neuron inputs of each neuron of each layer in a dictionary.  During backward propagation, we will work backward through those lists to change the weights of each neuron in each layer of the model.  This will include the initial inputs as well, as we will need those to update the first layer, and these are the outputs of the input layer.

The final activation of the list of activations in the dictionary is the model prediction, since it represents the activation of the neuron(s) in the final layer.

In [7]:
## Define a function for a neuron.
def forward_neuron(neuron, input):
    """Outputs the sum of the bias and the dot product of the weights and inputs.
    This is equivalent to multiplying the weights by the inputs and summing the result."""
    weights, bias = neuron
    z = np.dot(weights, input.T) + bias
    return z

## Define function for a layer.
def forward_layer(layer, input):
    """Performs a forward pass with each neuron in a layer and applies the activation function to each result.
    Returns a tuple of activations (Ax) and pre-activation neuron outputs (Zs)"""
    activation = layer[-1]
    neurons = layer[:-1]
    Zs = np.zeros((len(input),len(neurons)))  # Pre-allocate Zs with expected size
    As = np.zeros((len(input),len(neurons)))  # Pre-allocate As with expected size
    for i, neuron in enumerate(neurons):
        z = forward_neuron(neuron, input)
        a = activation(z)
        Zs[:,i] = z  # Assign values using indexing
        As[:,i] = a
    return As, Zs

## Define function for model
def forward_model(layers, input):
    """Performs forward passes for each layer in the model.  
    Returns an ordered list of layer activations and layer neuron outputs
    for each layer
    Note that the final activations of the list of layer activations will be the model predictions."""
    outputs = {'As': [input], 'Zs': [input]}
    for i, layer in enumerate(layers):
        As, Zs = forward_layer(layer, input)
        outputs['As'].append(As)  # Append lists directly
        outputs['Zs'].append(Zs)
        input = Zs
    return outputs

### Predictions

Let's take a look at our model predictions.

In [8]:
## Examine the outputs of the model.
outputs = forward_model(model, X_sc)
predictions = outputs['As'][-1]
predictions[:5]

array([[0.45820865],
       [0.4614505 ],
       [0.4615873 ],
       [0.46639715],
       [0.45914561]])

Note that the model predictions are all fairly similar and should not have any very meaningful pattern.  

## 4. Calculate Loss: Binary Cross-entropy

Since this is a binary classification model, we will use binary cross-entropy to calculate the loss.  Remember, we want as low a loss as possible, so our goal will be to decrease this loss in the next epoch.  This will not always immediate directly correlate to improved accuracy, precision, or recall, but as the loss decreases, those metrics will generally tend to improve.


Here's the formula for binary cross-entropy:

BCE = -(y * log(p) + (1 - y) * log(1 - p))

**where**:

* BCE: Binary Cross-Entropy loss

* y: True label (either 0 or 1)

* p: Predicted probability (a value between 0 and 1)

* log: Logarithm function (usually base 2 for information theory, but base e is common in machine learning)

**Explanation:**

* Measures the difference between the true distribution (y) and the predicted distribution (p).
* Assigns a higher penalty for more confident but incorrect predictions.
* Aims to minimize the loss during training, leading to better predictions.

**Key Points**:

Commonly used for binary classification problems.
Often used with the sigmoid activation function in the final layer of a neural network.
Can be extended to multi-class classification using categorical cross-entropy.

In [9]:
## Binary cross-entropy in NumPy
def binary_crossentropy(y_true, y_pred):
    y_true = y_true.reshape(-1,1)
    """Calculates the binary cross-entropy loss."""
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)  # Prevent numerical instability
    loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

loss = binary_crossentropy(y, predictions)
loss.mean()

0.6796196578972339

We will see if we can reduce this loss, which means that the predicted probabilies are closer to the actual class labels.

# Backward Propagation

In the following steps we will calculate the derivative of the function from model to loss.  In order to do this we will use the chain rule to propagation derivatives backward from the end (loss) to the beginning (hidden layer parameters) and adjust the weights and biases of each layer using gradient descent.

## 5. Calculate Derivatives of Loss and Sigmoid Activation

We will start with functions to calculate the derivative of the loss and sigmoid activations and combine them into one final loss derivative to perform gradient descent on the output layer.

These will calculate derivatives for each of the model predictions and true labels to pass back to the model.  The model will update based on the average of the gradients.

In [10]:
## Binary cross-entropy derivative function
def binary_crossentropy_activation_derivative(y_true, z, activation):
    """Calculates the derivative of the binary cross-entropy loss and final activation."""
    ## Reshape for consistent array sizes
    y_true = y_true.reshape(-1,1)

    ## Apply activation function to the final node output
    y_pred = activation(z)

    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)  # Prevent numerical instability

    ## Calculate the derivative of the BCE loss function
    bce_deriv = -(y_true / y_pred) + (1 - y_true) / (1 - y_pred) # BCE deriviative

    ## Calculate the derivative of the activation function
    activatation_deriv = activation(z, derivative=True) # Activation function derivative

    ## Combine the BCE and activation derivatives using the chain rule

    final_deriv = bce_deriv * activatation_deriv # Chain rule

    ## Ensure proper shape of derivatives
    if len(final_deriv.shape) < 2:
        final_deriv = final_deriv[np.newaxis,:]
    return final_deriv

## 6. Update the weights and bias of the layers

Now comes the model learning!  We start at the end of the model, the output layer, and propagate the changes in parameters upward, calculating the new gradient between each layer.

**Learning Rate**

We generally want our models to slowly converge toward the global minimum so they don't go too far!  For this reason we define a learning rate to slow down the learning and allow the model to correctly traverse the loss landscape.  This is a parameter that needs careful tuning, but it's generally a small number that is multipled times the gradient.

### Functions:

`update_node()` Our first function updates the weights and biases of the nodes.  We first calculate the gradient for the weights by multiplying the node output gradient with the node's inputs and take the mean of the gradient for each input.  This gives us the gradient of the function represented by that node, f(inputs) by the chain rule.

`calculate_previous_layer_gradient()` We also need to calculate the derivative of each node in this layer to pass back to the previous layer.  We do this by using, `np.dot()`, on the weights and current layer derivatives.  

`update_layer_and_calculate_gradient()` The final function combines the first two and applies them to each 

In [11]:
def update_node(weights, bias, learning_rate, next_layer_gradient, node_inputs):
    """Updates the weights and bias of a node in a neural network using gradient descent.

    Args:
        weights: A NumPy array of the node's weights.
        bias: A NumPy scalar representing the node's bias.
        learning_rate: The learning rate for gradient descent.
        next_layer_gradient: A NumPy array of the gradient received from the subsequent layer.
        node_inputs: A NumPy array of the inputs to the node.

    Returns:
        A tuple containing the updated weights and bias.
    """

    # Calculate gradient for weights
    weight_gradient = next_layer_gradient.reshape(-1,1) * node_inputs
    weight_gradient = weight_gradient.mean(0)
    # Update weights
    updated_weights = weights - learning_rate * weight_gradient

    # Calculate gradient for bias
    bias_gradient = next_layer_gradient.mean()  # Average gradient across features

    # Update bias
    updated_bias = bias - learning_rate * bias_gradient

    return updated_weights, updated_bias

def calculate_previous_layer_gradient(weights, current_gradient, activation_function=None):
    """Calculates the gradient for the previous layer in a neural network.

    Args:
        weights: A NumPy array of the weights connecting the current layer to the previous layer.
        current_gradient: A NumPy array of the gradient for the current layer.
        activation_function: The activation function used in the previous layer (optional).

    Returns:
        A NumPy array of the gradient for the previous layer.
    """

    # Calculate gradient before activation (if applicable)
    current_gradient = current_gradient.reshape(-1,1)
    pre_activation_gradient = np.dot(current_gradient, weights.reshape(1,-1)) ## Matrix multiplication

    # Apply activation function derivative (if applicable)
    if activation_function is not None:
        previous_layer_gradient = activation_function(pre_activation_gradient, derivative=True)
    else:
        previous_layer_gradient = pre_activation_gradient
    return previous_layer_gradient

def update_layer_and_calculate_gradient(layer, layer_inputs, next_layer_gradient, learning_rate):
    """Updates a layer's weights and biases using gradient descent and calculates the gradient for the previous layer.

    Args:
        layer: A list of nodes, where each node is [weights, bias]. The activation function is appended to the end of the list.
        layer_inputs: A NumPy array of the inputs to the layer.
        next_layer_gradient: A NumPy array of the gradient received from the subsequent layer.
        learning_rate: The learning rate for gradient descent.

    Returns:
        A tuple containing the updated layer and the gradient for the previous layer.
    """

    updated_nodes = []
    previous_layer_gradients = np.zeros((layer_inputs.shape[0], len(layer)))

    # Iterate through nodes in the layer
    for node_index, node in enumerate(layer[:-1]):  # Exclude the activation function
        print(f'Updating node {node_index}')
        weights, bias = node
        # Update node weights and bias
        updated_weights, updated_bias = update_node(
            weights, bias, learning_rate, next_layer_gradient[:, node_index], layer_inputs
        )
        # Append updated node to the new layer
        updated_nodes.append((updated_weights, updated_bias))

        # Calculate gradient for the previous layer (if not the first layer)
        previous_layer_gradients[:, node_index] = calculate_previous_layer_gradient(weights, 
                                                                    next_layer_gradient[:, node_index], 
                                                                    activation_function=layer[-1])[:,0]


    # Append activation function to the updated layer
    updated_nodes.append(layer[-1])

    return updated_nodes, previous_layer_gradients


## `backward_model()`
Our final backward propagation function will start with the final loss and activation derivatives and propagate the gradients backward through the layers of the model from end to beginning.  For this purpose, the function will create a reversed version of the model to iterate through (rather than iterating in reverse).  

![forward and backward propagation](Images/forward_and_back_propagation.png)

In [12]:
def backward_model(model, y_true, outputs, learning_rate):
    """Perform backward propagation on a model.
    1. Calculate gradients for the output layer
    2. Apply gradient descent to layers using backpropagation using the chain rule
    Args:
    model: The model to perform backpropagation on
    y_true: The true labels
    outputs: the dictionary of outputs of each layer from the forward_model() function
    learning_rate: The learning rate to apply to each gradient during gradient descent to control step size.  Recommend between .1 and .0001
    
    Returns: New model with backward propagation applied"""

    
    final_z = outputs['Zs'][-1]
    final_activation = model[-1][-1]
    gradient = binary_crossentropy_activation_derivative(y_true,
                                                        final_z,
                                                        final_activation)
    
    ## Reverse the models and the
    new_model = []
    reversed_model = model.copy()
    reversed_model.reverse()
    outputs['As'].reverse()
    outputs['Zs'].reverse()
    for i, layer in enumerate(reversed_model):
        print(f'Updating layer {i}')
        layer_inputs = outputs['As'][i+1]
        new_layer, gradient = update_layer_and_calculate_gradient(layer,
                                                                  layer_inputs=layer_inputs,
                                                                  next_layer_gradient=gradient,
                                                                  learning_rate=learning_rate)
        
        new_model.append(new_layer)
    
    new_model.reverse()
    return new_model

## Evaluate the Tuned Model

Now that we've trained our model, has our model successfully learned anything?  Let's evaluate it using our loss function to find out.  We will first evaluate our untrained model for a baseline, then compare it to the newly trained model.

In [13]:
## Evaluate the model with no training using binary cross-entropy
outputs1 = forward_model(model, X_sc)
pred1 = outputs1['As'][-1]
print('BCE of model with no training')
binary_crossentropy(y, pred1).mean()

BCE of model with no training


0.6796196578972339

In [14]:
## Evaluate the model after one epoch of learning

## Set learning rate (should be tuned)
learning_rate=.1

## Perform backward propagation on the model
model2 = backward_model(model, y, outputs, learning_rate)

Updating layer 0
Updating node 0
Updating layer 1
Updating node 0
Updating node 1


In [15]:
## Evaluate the new model
outputs2 = forward_model(model2, X_sc)
pred2 = outputs2['As'][-1]
print('BCE of model after 1 epoch of training')
binary_crossentropy(y, pred2).mean()

BCE of model after 1 epoch of training


0.5910228563325499

The model learned to more accurately classify iris flowers using various measurements of their petals and sepals!

![Brain](Images/256px-Human_Brain.png)

# Conclusion

We've built a small neural network capable of forward and backward propagation.  It iteratively learns to better predict the class labels by reducing the loss after each epoch of training.  It uses gradient descent to change the weights of each node in each layer according to the derivative of the loss function by applying the chain rule to propagate the calculate derivative backward to each layer.

Repeated epochs of forward and backward propagation should continue to reduce the loss...to a point.  Real world datasets nearly always have some irreducible noise.


## Challenges:

1. Create validation data and check for overfitting.  Save the loss at reach epoch and plot them.  Is there a number of epochs where the model starts to overfit?
2. Create a loop to train a model for n number of epochs
3. Convert this functional approach to an object-oriented approach.  A model should have forward and backward propagation methods, a training method that combines them, and keep interal variables of the outputs of each layer during a forward step for use during backward propagation, and the loss after each epoch for plotting after training.
4. Implement mini-batching by breaking the dataset into smaller batchs and training on each one in sequence, perhaps try batches of 10 or so.  This is a combination of **Stochastic** and **Batch Gradient Descent**