### Question 1

The main purpose of the forward propagation is to :
1. Initialize weights and biases
2. Calculate the predicted value based on weights and biases
3. Calculate the loss

### Question 2

Consider a network with two input features x1 and x2, weights w1 and w2, and bias b. The activation function is the sigmoid function (σ(z)).

Input vector: X = [x1, x2], 
Weight vector: W = [w1, w2], 
Bias: b

The weighted sum z would be:

z = w1 * x1 + w2 * x2 + b

The output y would be:

y = σ(z) = σ(w1 * x1 + w2 * x2 + b)

This represents the single neuron's activation based on the weighted combination of the input features and the bias.

### Question 3
Activation functions are mathematical functions applied to the weighted sum (also called linear combination or pre-activation) of inputs in each neuron during forward propagation.
These functions introduce non-linearity into the network's output, allowing it to model more intricate relationships between inputs and outputs.

### Question 4

Weights (W): These represent the strength of the connections between an input feature and a neuron. During the weighted sum calculation, a higher weight on a particular feature indicates a stronger influence on the neuron's activation. Weights are initially assigned random values and then adjusted through training using backpropagation.

Bias (b): This adds a constant value to the weighted sum of inputs. It allows the neuron to shift its activation function "up" or "down" independently of the input values, providing more flexibility in modeling the data.

### Question 5

Purpose of softmax function is to calculate the probabilites of the outputs belonging to a specific class.

Imagine a network classifying images as cats, dogs, or birds. The softmax function would convert the network's outputs for each class into probabilities (e.g., 0.8 for cat, 0.1 for dog, and 0.1 for bird). This indicates a high confidence (80%) in predicting the image as a cat.

### Question 6

Forward propagation calculates the output for a given input, but it doesn't tell us how well the network is performing. This is where backward propagation comes in.

- Error Calculation: Backward propagation calculates the error (difference between the predicted output and the actual target value) for each layer, starting from the output layer and moving backward through the network.

- Weight Adjustment: The errors are then used to adjust the weights and biases in a way that minimizes the overall error. This essentially fine-tunes the network's connections to improve its future predictions.

### Question 7

1. Error at Output: We calculate the difference between the actual output (y) and the desired target value (t).

2. Error Gradient: We calculate the derivative of the error with respect to the weighted sum (z) using the derivative of the activation function applied at the output.

3. Weight Update: We use the error gradient and a learning rate (η) to update the weights (W) and bias (b) in a way that reduces the error

### Question 8

The chain rule is a fundamental concept in calculus that allows us to differentiate composite functions (functions within functions). In backpropagation for multi-layer neural networks:

Each neuron's activation depends on the activations of the previous layer's neurons.
We need to calculate the gradients (rates of change) of the error with respect to the weights in all layers.
The chain rule provides a systematic way to "backpropagate" the error through the network, considering how each layer's activation contributes to the overall error.

### Question 9

Here are some common challenges that can arise during backward propagation in neural networks, along with strategies to address them:

**1. Vanishing or Exploding Gradients:**

- **Problem:** During backpropagation, the gradients used to update weights can become very small (vanishing) or very large (exploding) as they propagate through the network, especially in deep architectures. This can make learning slow or prevent it altogether.
- **Solutions:**
    - **Xavier/He initialization:** Initialize weights based on the number of input and output neurons in a layer to ensure gradients have a reasonable starting magnitude.
    - **Gradient clipping:** Limit the maximum value of gradients to prevent them from exploding.
    - **Residual connections (ResNets):** Introduce direct connections between layers (skip connections) to allow gradients to flow more easily through the network.

**2. Local Minima:**

- **Problem:** The optimization algorithm might get stuck in a local minimum, where the error is lower than its immediate surroundings but not the global minimum. This can lead to suboptimal performance.
- **Solutions:**
    - **Momentum:** Use momentum to incorporate the direction of previous gradient updates, helping to escape local minima.
    - **Learning rate scheduling:** Adjust the learning rate during training to control the size of steps taken towards the minimum.
    - **Early stopping:** Stop training if the validation error plateaus for a certain number of epochs to avoid overfitting to the training data.

**3. Overfitting:**

- **Problem:** The model performs well on the training data but fails to generalize to unseen data. This happens when the network learns the training data's noise instead of the underlying patterns.
- **Solutions:**
    - **L1/L2 regularization:** Add penalty terms to the loss function that encourage smaller weights, reducing model complexity and preventing overfitting.
    - **Dropout:** Randomly drop out a certain percentage of neurons during training, forcing the network to learn more robust features.
    - **Data augmentation:** Artificially create new training data examples by applying random transformations (e.g., rotations, flips) to existing data, increasing the diversity of training examples.

**4. Computational Cost:**

- **Problem:** Backpropagation can be computationally expensive, especially for large datasets or complex models.
- **Solutions:**
    - **Batching:** Train the network on mini-batches of data instead of the entire dataset at once, improving memory efficiency and potentially speeding up training.
    - **Gradient accumulation:** Accumulate gradients for multiple mini-batches before updating weights, reducing the number of parameter updates and potentially speeding up training.
    - **Parallelization:** Train the network on multiple GPUs or TPUs to distribute the computational load and accelerate training.