# Backpropagation in AI

## ELI5: Backpropagation

Imagine you're trying to teach a robot to throw a ball into a hoop. Every time the robot misses, you slightly adjust the robot's arm and the force it uses. Over time, the robot gets better at throwing the ball into the hoop. Backpropagation in AI is similar to this. It's a way for the computer to learn from its mistakes and adjust itself to get better results.

1. Start with random weights (like the robot's initial throw).
2. See how far off the prediction is (the difference between the throw and the hoop).
3. Adjust the weights a little bit to get closer to the correct answer.
4. Repeat until the computer (or robot) is good at the task.

## ELI Like More Advanced

Backpropagation is a supervised learning algorithm used for training artificial neural networks. It's a method to calculate the gradient of the loss function concerning each weight by the chain rule.

### Steps of Backpropagation:

1. **Forward Pass**:
    * Input a training sample.
    * Pass it through the network to get the prediction.
    * Calculate the error (difference between predicted and actual value).

2. **Backward Pass**:
    * Calculate the gradient of the error concerning each weight. This tells us how much the error would change if we changed the weights by a tiny amount.
    * Update the weights in the network using a learning rate.

### Important Formulas:

* The error for each neuron is calculated as:
    $$ \delta_j = (y_j - a_j) \cdot f'(h_j) $$

    where $ \delta_j $ is the error term for neuron $ j $, $ y_j $ is the actual output, $ a_j $ is the predicted output, and $ f'(h_j) $ is the derivative of the activation function for neuron $ j $.

* The weights are updated using the formula:
    $$ w_{ij} \leftarrow w_{ij} - \mu \cdot \delta_j \cdot x_i $$

    where $ w_{ij} $ is the weight from neuron $ i $ to neuron $ j $, $ \mu $) is the learning rate, $ \delta_j $ is the error term for neuron $ j $, and $ x_i $ is the output of neuron $ i $.

In simpler terms, backpropagation helps the network learn from its mistakes by adjusting the weights in the direction that reduces the error. It's like tweaking the knobs and dials of a complex machine until it works just right.


# Multilayer Perceptron (MLP)

A Multilayer Perceptron (MLP) is a class of feedforward artificial neural network. It consists of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node (or neuron) in a layer is connected to every node in the subsequent layer, with each connection having an associated weight.

## Architecture:

1. **Input Layer**: 
    * The initial layer where data is fed into the network.
    * It has as many nodes as there are input features.

2. **Hidden Layers**: 
    * These layers are between the input and output layers.
    * Neurons in hidden layers process patterns in the data by weighing inputs.

3. **Output Layer**: 
    * The final layer that produces the prediction or classification result.
    * It has as many neurons as there are classes for classification tasks or just one neuron for regression tasks.

## Activation Functions:
Neurons typically have an activation function that transforms their weighted input into an output signal. Common activation functions include:
* Sigmoid
* ReLU (Rectified Linear Unit)
* Tanh (Hyperbolic Tangent)

## Basic Python Illustration:

Let's consider a simple MLP with:
* 3 input features.
* 1 hidden layer with 4 neurons.
* 1 output neuron (for a binary classification task).


In [1]:
import numpy as np

# Activation functions and their derivatives

def sigmoid(x):
    """Sigmoid Activation Function:
    
    This function squashes values between 0 and 1. It's widely used for 
    outputs of binary classification problems.
    
    Args:
    - x (float): Input value or array.
    
    Returns:
    - float: Transformed value between 0 and 1.
    """
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of the Sigmoid Function:
    
    This function returns the gradient of the sigmoid function at a given point.
    It's used during backpropagation to adjust weights based on the error.
    
    Args:
    - x (float): Input value or array.
    
    Returns:
    - float: Gradient of the sigmoid function at 'x'.
    """
    return x * (1 - x)

# Sample MLP

class SimpleMLP:
    def __init__(self, input_size, hidden_size, output_size):
        """Initializer for the SimpleMLP class.
        
        Initializes weights and biases for the network. Weights are 
        initialized with random values, which will be adjusted during 
        training.
        
        Args:
        - input_size (int): Number of nodes in the input layer.
        - hidden_size (int): Number of nodes in the hidden layer.
        - output_size (int): Number of nodes in the output layer.
        """
        
        # Initialize weights and biases for input to hidden layer connections
        # Weights are matrices where each entry i,j is the weight from the i-th input node to the j-th hidden node.

        self.weights_input_to_hidden = np.random.randn(input_size, hidden_size)

        # Biases are added to the inputs to introduce non-linearity and flexibility to the model.

        self.bias_hidden = np.random.randn(hidden_size)
        
        # Initialize weights and biases for hidden to output layer connections
        self.weights_hidden_to_output = np.random.randn(hidden_size, output_size)
        self.bias_output = np.random.randn(output_size)
    
    def forward(self, x):
        """Forward Pass Through the Network.
        
        Takes an input 'x' and passes it through the network to produce an output.
        
        Args:
        - x (array): Input data.
        
        Returns:
        - array: Output from the network.
        """
        
        # Input to Hidden Layer
        # Calculate the dot product of the input data and weights, then add the bias

        self.hidden_input = np.dot(x, self.weights_input_to_hidden) + self.bias_hidden

        # Apply the sigmoid activation function to introduce non-linearity

        self.hidden_output = sigmoid(self.hidden_input)
        
        # Hidden to Output Layer
        # Calculate the dot product of the hidden layer output and weights, then add the bias

        self.output_input = np.dot(self.hidden_output, self.weights_hidden_to_output) + self.bias_output

        # Apply the sigmoid activation function to get the final output
        
        self.final_output = sigmoid(self.output_input)
        
        return self.final_output

# Create an instance of the MLP and test with a sample input
mlp = SimpleMLP(input_size=3, hidden_size=4, output_size=1)
sample_input = np.array([0.5, 0.6, 0.7])
output = mlp.forward(sample_input)
print(output)


[0.16324619]


# Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is a popular optimization algorithm used to minimize (or maximize) a function iteratively. It's especially useful for training large-scale machine learning models. Let's break down the name into its components to understand its meaning and working:

## 1. Gradient Descent:

Gradient Descent is an optimization algorithm used to minimize (or maximize) functions. The basic idea behind gradient descent is to:

1. Calculate the gradient (or slope) of the function at a given point.
2. Move in the opposite direction of the gradient (for minimization) by a certain step size or "learning rate."
3. Repeat the process until the function reaches a minimum (or maximum) value.

### Key Concepts:

* **Gradient**: It is the vector of partial derivatives. In the context of machine learning, the function we want to minimize is usually the loss or error function, and the gradient gives the direction of steepest increase of this function.
* **Learning Rate**: This is a hyperparameter that determines the size of the steps we take in the direction opposite to the gradient. A high learning rate can lead to overshooting the minimum, while a low learning rate can make the convergence very slow.

## 2. Stochastic:

In classical (or "batch") gradient descent, the gradient is computed using the entire dataset, which can be computationally expensive and time-consuming for large datasets. "Stochastic" in SGD refers to the use of a single random data point (or a small subset called a mini-batch) from the dataset to compute the gradient at each step instead of the entire dataset.

### Key Concepts:

* **Stochasticity**: Introduces randomness into the algorithm, which can help escape local minima and converge faster than batch gradient descent. However, this also means that SGD can have a lot of variance and might jump around the optimal value.
* **Mini-batch SGD**: A compromise between batch and pure stochastic gradient descent. Instead of using the entire dataset or a single example, a mini-batch of samples is used to compute the gradient. This can offer a balance between computational efficiency and convergence stability.

## Summary:

Stochastic Gradient Descent is an optimization technique where the model parameters are updated using the gradient of the error with respect to a single (or a mini-batch of) training example(s) rather than the entire training dataset. This approach can be much faster and can also navigate out of local minima, but might be noisier than the traditional gradient descent.


In [2]:
import numpy as np

# Define a simple quadratic loss function: L(w) = (w - 3)^2
# The minimum value of this function is 0, achieved when w = 3.
def loss_function(w):
    """Quadratic Loss Function
    
    This function calculates the loss for given weight 'w'.
    It's a simple quadratic function centered around 3.
    
    Args:
    - w (float): Current weight value.
    
    Returns:
    - float: Loss value.
    """
    return (w - 3) ** 2

def gradient(w):
    """Gradient of the Quadratic Loss Function
    
    This function calculates the gradient of the loss with respect to 'w'.
    For the function L(w) = (w - 3)^2, its gradient is dL/dw = 2(w - 3).
    
    Args:
    - w (float): Current weight value.
    
    Returns:
    - float: Gradient value.
    """
    return 2 * (w - 3)

def stochastic_gradient_descent(epochs, learning_rate, data):
    """Stochastic Gradient Descent Optimization
    
    This function attempts to find the weight 'w' that minimizes the loss function 
    using stochastic gradient descent.
    
    Args:
    - epochs (int): Number of passes through the dataset.
    - learning_rate (float): Step size for each weight update.
    - data (list): Dataset containing weight samples.
    
    Returns:
    - list: History of weight values throughout the training.
    """
    # Initialize weight randomly
    w = np.random.randn()
    weight_history = [w]
    
    for epoch in range(epochs):
        # Shuffle data to ensure randomness in picking samples
        np.random.shuffle(data)
        for sample in data:
            # Calculate gradient using a single data point (stochasticity)
            grad = gradient(sample)
            # Update weight in the direction of negative gradient
            w -= learning_rate * grad
            weight_history.append(w)
    
    return weight_history

# Test our SGD implementation
data_samples = np.linspace(0, 6, 100)  # Generate 100 samples between 0 and 6
epochs = 10
learning_rate = 0.1
weights = stochastic_gradient_descent(epochs, learning_rate, data_samples)

# Print final weight
print("Final weight after SGD:", weights[-1])


Final weight after SGD: -1.4341845338324233


# Backpropagation: Forward and Backward Phases

Backpropagation is the backbone of training deep neural networks. It consists of two main phases:

## 1. Forward Phase

In this phase, we move from the input layer to the output layer, predicting the output:

1. **Input Data**: Begin by feeding a training example into the network.
2. **Calculate Activations**: For each layer:
    * Compute the weighted sum of the inputs from the previous layer.
    * Apply an activation function (like sigmoid or ReLU) to these weighted sums.
    * This produces the activation values for the current layer.
3. **Predict Output**: After passing through all layers, the network produces a prediction based on the final layer's activations.

The forward phase's main goal is to compute the network's output and see how it compares to the actual target or label.

## 2. Backward Phase

This phase is where the learning happens. We move from the output layer back to the input layer, adjusting the weights:

1. **Compute Error**: Calculate the difference between the predicted output and the actual label. This tells us how well (or poorly) our network performed.
2. **Propagate Error Backward**: For each layer, starting from the output:
    * Compute the gradient of the loss with respect to the activations.
    * Use this gradient to calculate how much each neuron in the previous layer contributed to the error (this is the "backpropagation" step).
3. **Update Weights**: Adjust the weights in each layer using the gradients computed in the previous step. The adjustments are made in the direction that reduces the error.

The backward phase's main goal is to find out how much each weight in the network contributed to the error and then adjust it to reduce the error.

## Summary:

Think of backpropagation as a teacher correcting a student's homework. 

* In the **forward phase**, the student (neural network) tries to solve the problem (makes a prediction). 
* In the **backward phase**, the teacher (backpropagation algorithm) checks the work, sees where the student made mistakes, and provides guidance on how to correct them (adjusts the weights).


# Pseudocode for Backpropagation Algorithm

## 1. Initialization:
- Initialize all weights and biases in the network randomly.
- Define a learning rate $ \alpha $.

## 2. For each training sample:
### Forward Phase:

1. **Input Data**: 
    - Set the activations for the input layer: $ a^{(0)} = \text{input} $

2. **Propagation Forward through the Layers**:
    - For layer $ l = 1 $ to $ L $ (where $ L $ is the output layer):
        - Compute weighted sum: $ z^{(l)} = w^{(l)} \cdot a^{(l-1)} + b^{(l)} $
        - Compute activation: $ a^{(l)} = \text{activation\_function}(z^{(l)}) $

3. **Output**: 
    - The prediction will be the activations from the last layer: $ \text{prediction} = a^{(L)} $

### Backward Phase:

4. **Compute Output Layer Error**:
    - Calculate the difference between the network's output and the actual target: $ \delta^{(L)} = \text{prediction} - \text{target} $

5. **Backpropagate the Error**:
    - For layer $ l = L-1 $ down to $ 1 $:
        - Compute the gradient of the activation function: $ g' = \text{activation\_function\_derivative}(z^{(l)}) $
        - Compute the error for layer $ l $: $ \delta^{(l)} = (w^{(l+1)})^T \cdot \delta^{(l+1)} \times g' $

6. **Update Weights and Biases**:
    - For layer $ l = 1 $ to $ L $:
        - Update weights: $ w^{(l)} = w^{(l)} - \alpha \cdot \delta^{(l)} \cdot (a^{(l-1)})^T $
        - Update biases: $ b^{(l)} = b^{(l)} - \alpha \cdot \delta^{(l)} $

## 3. Iterate:
- Repeat the process for a specified number of iterations or until the network converges (i.e., the error becomes sufficiently small).

## 4. End:
- After training, use the optimized weights and biases for predictions on new, unseen data.
