## 1. Explain the concept of forward propagation in a neural network.

![image.png](attachment:7a5e4624-7906-45f9-b5a0-d443f5059594.png)

### What is forward propagation?
---
Forward propagation refers to storage and calculation of input data which is fed in forward direction through the network to generate an output. Hidden layers in neural network accepts the data from the input layer, process it on the basis of activation function and pass it to the output layer or the successive layers. Data flows in forward direction so as to avoid circular shape flow of data which will not generate an output. The network configuration that helps in forward propagation is known as feed-forward network.

You can call it as thetas' first initialization, the period when the network get to know the data for the first time, later we'll use backpropagation to develop thetas in each layer.

---

![image.png](attachment:a7dae206-9dbe-4923-b63d-ec3f8c25130a.png)

### Important Terms:
1. The first layer called the input layer. It has all the features you want to feed into the network.
2. From the second layer till the layer before the last layer we call them the hidden layers.
3. Finally the last layer called the Outcome/ Output layer.
    - The outcome layer has one neurn just for binary classification, but you can have more than one in case you have a problem of multiclass      classification, In this case you will use softmax function, we might know it later.
---
### Steps of Forward Propagation:

1. Select carefully the features you want to feed into the network
   - To avoid overfitting / complexity of the model.
   - Reduce the time used to train the model
2. Add the bias term.
3. Each feature will have the same number as thetas as the number of first layer's neurns, plus the bias.
4. Preactivation: it is a weighted sum of inputs i.e. the linear transformation of weights w.r.t to inputs available. Based on this aggregated sum and activation function the neuron makes a decision whether to pass this information further or not.
5. Activation: the calculated weighted sum of inputs is passed to the activation function. An activation function is a mathematical function which adds non-linearity to the network. There are four commonly used and popular activation functions — sigmoid, hyperbolic tangent(tanh), ReLU and Softmax.
---
### **Key Steps in Forward Propagation**
1. **Input Layer**:
   - The process starts with the input features (e.g., numerical data, image pixels, etc.) being fed into the input layer of the neural network.

2. **Weight and Bias Application**:
   - Each connection between nodes in different layers is associated with a weight, which determines the strength of the connection. Each neuron also has a bias term added to adjust the output.

   - For a given neuron, the weighted sum of inputs is calculated:
     \[
     z = \sum_{i=1}^{n} w_i x_i + b
     \]
     where \(w_i\) are the weights, \(x_i\) are the input values, and \(b\) is the bias.

3. **Activation Function**:
   - The result (\(z\)) of the weighted sum is passed through an activation function. This introduces non-linearity, enabling the network to learn complex patterns.

   - Common activation functions include:
     - **Sigmoid**: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
     - **ReLU**: \(f(z) = \max(0, z)\)
     - **Tanh**: \(f(z) = \tanh(z)\)

4. **Propagation Through Layers**:
   - The outputs of one layer (after applying the activation function) serve as inputs to the next layer. This process is repeated until the data reaches the output layer.

5. **Output Layer**:
   - In the final layer, the output is computed based on the network's architecture and the problem type:
     - **Regression Problems**: Output is usually a single continuous value (e.g., no activation or linear activation).
     - **Classification Problems**: Output is often probabilities (using softmax or sigmoid activation).

---

### **Purpose of Forward Propagation**
- Forward propagation is used to make predictions or compute the output of the neural network given input data.
- During training, the output of forward propagation is compared with the target labels to calculate the **loss**. This loss guides the network's learning through backpropagation.

---

### **Example of Forward Propagation**
Consider a simple neural network with:
- Two inputs (\(x_1, x_2\)),
- One hidden layer with two neurons,
- One output neuron.
---
#### Forward Propagation Steps:
1. **Input Layer**:
   - Input values: \(x_1 = 0.5, x_2 = 0.8\).

2. **Hidden Layer**:
   - Weighted sums and activations are computed for each hidden neuron:
     - \(z_1 = w_{11}x_1 + w_{12}x_2 + b_1\)
     - \(a_1 = \text{Activation}(z_1)\)
     - Similarly for \(z_2\) and \(a_2\).

3. **Output Layer**:
   - Using the activations from the hidden layer (\(a_1, a_2\)):
     - \(z_{\text{output}} = w_{o1}a_1 + w_{o2}a_2 + b_{\text{output}}\)
     - Output: \(y = \text{Activation}(z_{\text{output}})\).

This output (\(y\)) represents the neural network's prediction.

--- 

## 2. What is the purpose of the activation function in forward propagation?

The activation function plays a crucial role in forward propagation in a neural network. Its main purpose is to introduce **non-linearity** into the model, which allows the network to learn and represent complex patterns in the data. Without activation functions, the neural network would essentially behave like a linear model, regardless of its depth.

---

### **Purposes of the Activation Function**

1. **Introduce Non-Linearity**:
   - In real-world problems, the relationships between inputs and outputs are often non-linear. Activation functions enable the network to model these relationships effectively.
   - Without activation functions, a neural network composed of multiple layers would collapse into a single-layer linear model, as the composition of linear functions is still linear.

2. **Enable Learning of Complex Features**:
   - Non-linear activation functions allow different neurons to capture complex features or patterns from the input data. 
   - For example:
     - In image recognition, earlier layers might detect edges, while deeper layers recognize objects like faces or cars.

3. **Control Output Range**:
   - Activation functions often constrain the output of neurons to a specific range, making the network more stable during training and better at handling varying input magnitudes.
     - For example:
       - **Sigmoid** outputs values between \(0\) and \(1\), useful for probabilities.
       - **Tanh** outputs values between \(-1\) and \(1\), providing zero-centered outputs.

4. **Determine Layer Behavior**:
   - Different activation functions affect how each layer processes inputs:
     - **ReLU** (Rectified Linear Unit) activates only positive inputs, introducing sparsity.
     - **Softmax** converts raw scores into probabilities, making it suitable for classification tasks.

5. **Prevent Signal Decay or Explosion**:
   - Proper choice of activation functions ensures the gradient during backpropagation doesn’t vanish or explode, particularly in deep networks. For example:
     - ReLU avoids the vanishing gradient problem that sigmoid and tanh sometimes face.

---

### **Examples of Common Activation Functions**

1. **Sigmoid**:
   \[
   \sigma(z) = \frac{1}{1 + e^{-z}}
   \]
   - Purpose: Outputs values between 0 and 1, useful for binary classification.
   - Limitation: Can cause vanishing gradients for large inputs.

2. **ReLU (Rectified Linear Unit)**:
   \[
   f(z) = \max(0, z)
   \]
   - Purpose: Simple and efficient, allows networks to learn faster.
   - Limitation: Can cause "dead neurons" when activations become zero.

3. **Tanh (Hyperbolic Tangent)**:
   \[
   \text{tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
   \]
   - Purpose: Outputs values between \(-1\) and \(1\), providing zero-centered outputs.
   - Limitation: Also susceptible to vanishing gradients.

4. **Softmax**:
   \[
   \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}
   \]
   - Purpose: Converts raw scores into probabilities for multi-class classification problems.

---

### **Why Non-Linearity is Critical**
Without non-linear activation functions, no matter how many layers a network has, the entire model would behave as a linear combination of inputs:
\[
f(x) = W_2(W_1x + b_1) + b_2 = Wx + b
\]
This would limit the network's ability to solve problems where data relationships are inherently non-linear.

In summary, activation functions make neural networks powerful and versatile tools for solving a wide range of complex tasks.

### Describe the steps involved in the backward propagation (backpropagation) algorithm?

### forward and Backpropagation in Neural Networks:

Forward propagation in neural networks refers to the process of passing input data through the network’s layers to compute and produce an output. Each layer processes the data and passes it to the next layer until the final output is obtained. During this process, the network learns to recognize patterns and relationships in the data, adjusting its weights through backpropagation to minimize the difference between predicted and actual outputs.

![image.png](attachment:13baab95-5637-47ad-b82d-a5b094698b0b.png)

The backpropagation procedure entails calculating the error between the predicted output and the actual target output while passing on information in reverse through the feedforward network, starting from the last layer and moving towards the first. To compute the gradient at a specific layer, the gradients of all subsequent layers are combined using the chain rule of calculus.

Backpropagation, also known as backward propagation of errors, is a widely employed technique for computing derivatives within deep feedforward neural networks. It plays a crucial role in various supervised learning algorithms used for training this type of neural networks.

Training of a neural network involves using gradient descent, an iterative optimization algorithm for discovering a local minimum of a differentiable function. During the training process, a loss function is computed to measure the disparity between the network’s predictions and the actual values. Backpropagation enables the calculation of the gradient of the loss function concerning every weight in the network. This capability enables individual weight updates, gradually reducing the loss function over multiple training iterations.

---

### What does the backpropagation process look like?

Backpropagation serves the purpose of minimizing the cost function by fine-tuning the neural network’s weights and biases. The extent of these adjustments hinges on the gradients of the cost function concerning these specific parameters. By computing the gradients through the chain rule, backpropagation efficiently propagates error information backward through the network. Consequently, the network can iteratively update its parameters in the direction opposite to the gradient. This iterative process enables the neural network to converge towards improved performance and accurate predictions.

The fundamental steps involved in computing the gradients of the weights in a neural network in backpropagation, are the forward and backward passes.

---

### Forward pass
During the forward pass, the input data is propagated through the network layer by layer, starting from the input layer and moving towards the output layer. Each neuron in the network receives inputs, calculates a weighted sum of the inputs, applies an activation function, and passes the output to the next layer. This process continues until the final output is obtained. The forward pass calculates the output of the network based on the current weights.

Before we proceed to the backward pass, we need to introduce the quickest possible way to calculate the optimal weights, which is not a trivial task for complicated multiparametric networks. This is where the computational graph comes into play.

---

### What is a computational graph?
A computational graph is a directed graph used to represent the computations performed inside a model. The graph typically starts with inputs like data (X) and labels (Y). As we move from left to right in the graph, we encounter nodes representing fundamental computations involved in computing the function. For instance, there are nodes for matrix multiplication between input (X) and weight matrix (W), a red node for hinge loss (used in SVM classifiers), and a green node for the regularization term in the model. The graph concludes with an output node representing the scalar loss (L) to be computed during model training. While this computational graph might seem simple for linear models due to the limited number of operations, it becomes more complex and crucial for intricate models with multiple computations.

---

![image.png](attachment:c96a7484-4f69-4d7d-bd08-6b260f9ec293.png)

---

And when we go backwards, with the goal of calculating the optimal loss function, a computational graph is the means to an optimal solution, which significantly reduces the required computations.

The process is explained in detail in this whitepaper.

At each node, reverse-mode differentiation merges all paths that originated at that node. We do not need to evaluate all possible combinations of the weights’ mutual influence, but thanks to derivatives, we can have the proper coefficients by computing the backward operation for each of the nodes only once.

---

### Backward pass
In the backward pass, the gradients of the weights are computed by propagating the error backwards through the network. It starts from the output layer and moves towards the input layer. The error is quantified by comparing the predicted output of the network with the true output or target value. The gradient of the loss function with respect to each weight is calculated using the chain rule of calculus, which involves computing the partial derivatives of the weights at each layer. The gradients are then used to update the weights of the network, aiming to minimize the loss function.

The backward pass essentially determines how much each weight contributed to the overall error and adjusts them accordingly. By iteratively performing forward and backward passes, the network learns to adjust its weights, improving its ability to make accurate predictions.

---

### What are the types of the backpropagation algorithm?

The two main types of backpropagation networks are static backpropagation, which provides instant mapping, and recurrent backpropagation, which involves fixed-point learning.

---
*1.Static backpropagation:* It is commonly used in feedforward neural networks and some convolutional neural networks (CNNs) where there is no temporal dependence between data points. The algorithm accumulates the gradients of the loss function over a batch of data points and then performs a single update to the model’s parameters. The batching process helps to take advantage of parallel processing capabilities in modern hardware, making the training process more efficient for large datasets.

This algorithm is capable of solving static classification problems, such as optical character recognition (OCR).

---

*2.Recurrent backpropagation:* Recurrent backpropagation is an extension of the backpropagation algorithm used in recurrent neural networks (RNNs). In RNNs, the data flows in cycles through a series of interconnected nodes, allowing the network to retain information from previous time steps.

Recurrent backpropagation involves propagating the error signal backward through time in the RNN. It calculates the gradients of the loss function with respect to the model’s parameters over multiple time steps, taking into account the dependencies and interactions between the current time step and the previous ones. This process enables the network to learn and update its parameters to improve its performance on tasks that require sequential or temporal dependencies, such as natural language processing, speech recognition, and time series prediction.

---

### Why use backpropagation?
After the completion of the forward pass, the network’s error is evaluated and ideally should be minimized.

If the current error is high, it indicates that the network has not effectively learned from the data. In other words, the current set of weights is not sufficiently accurate to minimize the error and generate precise predictions. Consequently, it becomes necessary to update the neural network weights to reduce the error.

Backpropagation algorithm plays a crucial role in weight updates with the objective of minimizing the error.

---

### Advantages of the backpropagation algorithm
Backpropagation offers several key benefits:

1. Memory efficiency
It efficiently calculates derivatives, utilizing less memory compared to alternative optimization algorithms like the genetic algorithm. This is particularly beneficial when working with large neuron networks.

2. Speed
It is fast, particularly for small and medium-sized NNs. However, as the number of layers and neurons increases, the computation of more derivatives can result in slower performance.

3. Versatility
This algorithm is applicable to various network architectures, including convolutional neural networks, generative adversarial networks, fully-connected networks, and more. Backpropagation’s generic nature allows it to work effectively in diverse scenarios.

4. Parameter simplicity
Backpropagation does not require tuning specific parameters, thereby reducing overhead. The only parameters involved in the process are associated with the gradient descent algorithm, such as the learning rate.

While working with neural networks, we can utilize different algorithms to reduce the output of the loss function and learning rate to provide more precise results. There are many alternative methods for modifying the attributes of your neural network, such as Adam (Adaptive Moment Estimation), which has been state-of-the-art for years, Nesterov Accelerated Gradient, AdaGrad, and AdaDelta.

If you wish to learn more on this point, check out this detailed description of different optimizers.

One of the most advanced algorithms for loss function optimization is the Sophia optimizer, released in May 2023 by Stanford researchers. The classical example of such optimizers is the cost function, which we explain below.

---

### Computing backpropagation: Cost function
Cost function represents the square of the difference between the model’s output and the desired output.

When applying a neural network to millions of images with associated pixel values, we can assume predicted output and the corresponding actual values.

A smaller cost function indicates better model performance on the training data. Moreover, a model with a minimized cost function is expected to perform well on unseen data as well.

The cost function takes all the input, which can involve millions of parameters, and produces a single value. This value serves as a guide, indicating how much improvement is required in the model. It informs the model that it is performing poorly and adjustments are needed in its weights and biases. However, simply informing the model about its performance is not sufficient. We also need to provide a method to the model that allows it to minimize the error.

This is where gradient descent and backpropagation come into play, providing the means for the model to update its parameters and reduce the cost function.

---

### Gradient descent:

![image.png](attachment:2693f86e-68db-4dfd-8c41-0d677f9979d9.png)

To achieve better parameter tuning and minimize discrepancies between actual and training output, we employ an intuitive algorithm called gradient descent. Currently, gradient descent is the most popular optimization strategy in machine learning and deep learning. This algorithm identifies errors and effectively reduces them. Mathematically, it optimizes the convex function by finding the minimum point.

The concept of gradient can be understood as the measurement of how a function’s output changes when its inputs are slightly modified. It can also be visualized as the slope of a function, where a higher gradient indicates a steeper slope and facilitates faster learning for the model. Metaphorically, you can liken it to descending to the bottom of a valley rather than ascending a hill. This is because it is an optimization algorithm that minimizes a given function.

---

### Types of gradient descent
Now let’s explore different types of gradient descent.

#### Batch gradient descent:

The batch size refers to the total number of training examples included in a single batch. Since it is not feasible to pass the entire dataset into the neural network at once, the dataset is divided into multiple batches or subsets.

In batch gradient descent, the complete dataset is utilized to compute the gradient of the cost function. However, this approach can be slow since it requires calculating the gradient over the entire dataset for each update. It can be challenging, especially with large datasets. The cost function is computed after initializing the parameters, and the process involves reading all the records into memory from the disk. After each iteration, one step is taken, and the process is repeated.

#### Mini-batch gradient descent:

Mini-batch gradient descent is a commonly used algorithm that provides faster and more accurate results. The dataset is divided into small groups or batches of ‘n’ training examples. Unlike batch gradient descent, mini-batch gradient descent does not use the entire dataset. In each iteration, a subset of ‘n’ training examples is employed to compute the gradient of the cost function. This approach reduces the variance of parameter updates, leading to more stable convergence. Additionally, it can leverage optimized matrix operations, enhancing gradient computations’ efficiency.

#### Stochastic gradient descent:
Stochastic gradient descent (SGD) updates the model’s parameters based on the gradient computed for a random subset of the data at each iteration, thus allowing for faster computation. At each iteration (or epoch) of training, a random batch of data points is selected from the training dataset. The gradient of the loss function with respect to the model’s parameters is then calculated using the selected batch. Next, the model’s parameters are updated based on the computed gradient. The update is performed in the opposite direction of the gradient to move towards the minimum of the loss function. These steps are repeated for a fixed number of iterations or until convergence criteria are met.

---

### 4. What is the purpose of the chain rule in backpropagation?

The **chain rule** is fundamental to backpropagation because it allows us to efficiently compute the gradients of the loss function with respect to the parameters (weights and biases) of a neural network. Since neural networks consist of multiple layers, the chain rule provides a systematic way to calculate how changes in the parameters of earlier layers affect the final output and loss.

---

### **Purpose of the Chain Rule in Backpropagation**

1. **Efficiently Compute Gradients Across Layers**:
   - Neural networks are structured as a series of nested functions. For instance, the output of a three-layer network can be expressed as:
     \[
     \hat{y} = f_3(f_2(f_1(x)))
     \]
     Here:
     - \(f_1(x)\): First layer's output.
     - \(f_2(f_1(x))\): Second layer's output.
     - \(f_3\): Final layer's output.

   - The chain rule enables us to compute the gradient of the loss (\(\mathcal{L}\)) with respect to the parameters of each layer step-by-step:
     \[
     \frac{\partial \mathcal{L}}{\partial W_i} = \frac{\partial \mathcal{L}}{\partial f_n} \cdot \frac{\partial f_n}{\partial f_{n-1}} \cdot \dots \cdot \frac{\partial f_i}{\partial W_i}
     \]

2. **Propagate Error Signals from Output to Input**:
   - The chain rule breaks down the contribution of each layer to the total error. Starting at the output layer, the error is propagated backward through the network, layer by layer, using the chain rule to compute how much each weight and bias contributed to the error.

3. **Handle Complex, Non-Linear Structures**:
   - Neural networks are highly non-linear due to activation functions like ReLU, sigmoid, or tanh. The chain rule accounts for these non-linearities by computing derivatives for each function in the network and chaining them together.

4. **Parameter Optimization**:
   - Backpropagation uses the chain rule to compute the gradients required for optimization algorithms (e.g., Gradient Descent). These gradients dictate how to adjust the weights and biases to minimize the loss function.

---

### **Example: Chain Rule in Backpropagation**
Consider a simple neural network with one hidden layer:
\[
\hat{y} = f_2(f_1(x))
\]
Let:
- \(f_1(x) = W_1 \cdot x + b_1\) (first layer output),
- \(f_2(f_1) = W_2 \cdot f_1 + b_2\) (final output).

The loss function is \(\mathcal{L}(\hat{y}, y_{\text{true}})\).

#### Gradient Computation:
1. **Gradient at the output layer** (\(W_2\)):
   \[
   \frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial W_2}
   \]

2. **Gradient at the hidden layer** (\(W_1\)):
   \[
   \frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial f_1} \cdot \frac{\partial f_1}{\partial W_1}
   \]

Here, the chain rule allows gradients to "flow" backward from the output, layer by layer.

---

### **Why the Chain Rule is Essential in Backpropagation**
Without the chain rule:
- Computing gradients in multi-layer networks would be intractable.
- Each layer's parameters would have to be analyzed independently for their effect on the output, leading to inefficient and redundant calculations.

By applying the chain rule, backpropagation leverages the modular structure of neural networks, ensuring efficient gradient computation and enabling the training of deep networks.

### 5. Implement the forward propagation process for a simple neural network with one hidden layer using NumPy.

Here’s how you can implement forward propagation for a simple neural network with one hidden layer using NumPy. This example assumes the network has:  

- An input layer with \(n_{\text{input}}\) neurons.  
- A hidden layer with \(n_{\text{hidden}}\) neurons using the ReLU activation function.  
- An output layer with \(n_{\text{output}}\) neurons using the softmax activation function (for multi-class classification).  

### **Code Implementation**

In [84]:
import numpy as np

# Define the activation functions
def relu(z):
    return np.maximum(0, z)

def softmax(z):
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))  # Numerical stability
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Define the forward propagation function
def forward_propagation(X, W1, b1, W2, b2):
    """
    Perform forward propagation for a simple neural network with one hidden layer.
    
    Parameters:
    X   : Input data matrix (shape: m x n_input)
    W1  : Weights for the hidden layer (shape: n_input x n_hidden)
    b1  : Biases for the hidden layer (shape: 1 x n_hidden)
    W2  : Weights for the output layer (shape: n_hidden x n_output)
    b2  : Biases for the output layer (shape: 1 x n_output)
    
    Returns:
    A2  : Output probabilities from the output layer (shape: m x n_output)
    Z1, A1, Z2: Intermediate computations for backpropagation (optional)
    """
    # Hidden layer computations
    Z1 = np.dot(X, W1) + b1  # Weighted sum for hidden layer
    A1 = relu(Z1)            # Activation for hidden layer

    # Output layer computations
    Z2 = np.dot(A1, W2) + b2  # Weighted sum for output layer
    A2 = softmax(Z2)          # Activation for output layer

    return A2, Z1, A1, Z2

# Example usage
if __name__ == "__main__":
    # Input data (e.g., 3 samples with 4 features each)
    X = np.array([[0.5, 0.2, 0.1, 0.4],
                  [0.9, 0.8, 0.3, 0.7],
                  [0.4, 0.6, 0.8, 0.1]])

    # Initialize weights and biases for the hidden and output layers
    np.random.seed(42)  # For reproducibility
    n_input = 4
    n_hidden = 5
    n_output = 3

    W1 = np.random.randn(n_input, n_hidden) * 0.01  # Small random values
    b1 = np.zeros((1, n_hidden))  # Biases initialized to zero
    W2 = np.random.randn(n_hidden, n_output) * 0.01
    b2 = np.zeros((1, n_output))

    # Perform forward propagation
    A2, Z1, A1, Z2 = forward_propagation(X, W1, b1, W2, b2)

    # Output the results
    print("Output probabilities (A2):")
    print(A2)


Output probabilities (A2):
[[0.33331655 0.3333473  0.33333614]
 [0.33329117 0.33337548 0.33333335]
 [0.33330031 0.33336065 0.33333905]]


### **Explanation of the Code**

1. **Input Data (\(X\))**:
   - A 2D matrix where each row is a sample and each column is a feature.

2. **Weights and Biases**:
   - `W1` and `b1`: Parameters for the hidden layer.
   - `W2` and `b2`: Parameters for the output layer.

3. **Hidden Layer Computation**:
   - \(Z_1 = X \cdot W_1 + b_1\): Linear combination of inputs.
   - \(A_1 = \text{ReLU}(Z_1)\): Apply ReLU activation.

4. **Output Layer Computation**:
   - \(Z_2 = A_1 \cdot W_2 + b_2\): Linear combination of hidden layer outputs.
   - \(A_2 = \text{Softmax}(Z_2)\): Convert to probabilities.

5. **Output**:
   - \(A_2\): Probabilities for each class for each input sample.

This implementation is modular, allowing easy extension for deeper networks or other activation functions.