1. Explain the concept of forward propagation in a neural network

**Forward Propagation in a Neural Network**

Forward propagation is the process by which input data passes through a neural network to produce an output. It is the first phase of training (or prediction) where each neuron's output is computed based on its inputs and the weights.

**Step-by-Step Process**

    Let’s say a neural network has:

Input layer

One or more hidden layers

Output layer

    Each layer consists of neurons (units), and each connection has a weight and possibly a bias.

**Mathematical Formulation**

    For each neuron in layer l:

1. Weighted Sum (Linear Transformation):

z[l]=W[l]⋅a[l−1]+b[l]
 
W[l] : weights of layer 

a[l−1] : activations (outputs) from previous layer

b[l]: bias of layer 

2. Activation Function:

a[l]=f(z[l])

f: non-linear activation (e.g., ReLU, sigmoid, tanh)

    You repeat this from the input layer through the hidden layers to the output layer.

**Example (Simple Network)**

Input Layer → Hidden Layer (ReLU) → Output Layer (Sigmoid)

    Given input:

X=[x1,x2]

Hidden Layer:

z[1] = W[1]X+b[1]
 
𝑎[1] = ReLU(𝑧[1])

Output Layer:

z[2] = W[2]a[1]+b[2]
 
𝑦^ = Sigmoid(z[2])

**Purpose of Forward Propagation**

In training: to calculate the predicted output y^ and the loss (how far prediction is from the true output).

In inference: to make predictions using learned weights.



2. What is the purpose of the activation function in forward propagation

The activation function is a crucial component in forward propagation of a neural network. It introduces non-linearity into the network, allowing it to learn complex patterns and make meaningful predictions.

**Why Activation Functions Are Important**

1. Introduce Non-Linearity

Without activation functions, each layer in a neural network would just perform linear transformations.

A stack of linear layers is still just a linear function, regardless of depth.

Real-world data is often non-linear, so we need activation functions to model non-linear relationships.

2. Allow Learning of Complex Representations

    Non-linear activation functions allow the network to learn:

Interactions between features

Hierarchical patterns (especially in deep networks like CNNs, RNNs)

3. Enable Deep Learning

Deep networks rely on non-linear activations to create layer-wise abstractions.

Each hidden layer transforms the data into a more meaningful representation.



**Common Activation Functions**

| Activation  | Formula                             | Characteristics                                                       |
| ----------- | ----------------------------------- | --------------------------------------------------------------------- |
| **ReLU**    | $\max(0, x)$                        | Fast, avoids vanishing gradients                                      |
| **Sigmoid** | $\frac{1}{1 + e^{-x}}$              | Outputs in (0,1); good for binary classification                      |
| **Tanh**    | $\frac{e^x - e^{-x}}{e^x + e^{-x}}$ | Outputs in (-1,1); zero-centered                                      |
| **Softmax** | $\frac{e^{x_i}}{\sum_j e^{x_j}}$    | Converts scores to probabilities (used in multi-class classification) |


**Example**

    Suppose:

z=W⋅x+b=2.5

1. Linear Output:

Without activation:

a=z=2.5(linear)


2. With ReLU:

a=max(0,2.5)=2.5

3. With Sigmoid:

a = 1/1+e^−2.5 ≈ 0.92

Each activation changes the output and affects how the network learns.



3.  Describe the steps involved in the backward propagation (backpropagation) algorithm


**Steps Involved in the Backward Propagation (Backpropagation) Algorithm**

Backpropagation is the core algorithm used to train neural networks. It is the process of computing gradients of the loss function with respect to the weights and biases, and then updating the parameters using those gradients to minimize the loss.

**Overall Training Loop:**

1. Forward Propagation → Compute predictions

2. Compute Loss → Measure prediction error

3. Backward Propagation → Compute gradients

4. Update Weights → Use gradient descent

**Steps in Backpropagation (Layer by Layer)**

Let’s assume a neural network with L layers.

  Notation:
  
a[l] : activation/output of layer l

z[l] : pre-activation (z=Wa^[l−1]+b)

W^[l],b^[l] : weights and biases

y^ : predicted output

y: true label

L: loss function (e.g., MSE, cross-entropy)

f[l] : activation function of layer 


**Step-by-Step Breakdown**

  Step 1: Compute Loss

Calculate the loss between the predicted output y^ and the true value y:

L=Loss(y^,y)

  Step 2: Compute Derivative of Loss w.r.t. Output Activation

For the output layer L:

δ[L]
 = ∂L/∂a[L]⋅f′[L](z[L])

This is the gradient of the loss with respect to the weighted input at the output layer.

  Step 3: Backpropagate Error Through Layers

For each layer l from L to 1:

1. Compute gradient of loss w.r.t. weights:

∂L/∂W[l] = δ[l]⋅(a[l−1])T
 
2. Compute gradient of loss w.r.t. biases:

∂L/∂b[l] = δ[l]
 
3. Propagate error to previous layer:

δ[l−1] = (W[l])T⋅δ[l]⋅f′[l−1](z[l−1])

Step 4: Update Parameters (Gradient Descent)

W[l] :=W[l]−α⋅ ∂L/∂W[l]
 

b[l] :=b[l]−α⋅∂L/∂b[l]
 
α is the learning rate.

**Visual Overview**

Forward Propagation →
    Input → Hidden Layers → Output → Compute Loss

Backward Propagation ←

    Output Layer: δ

    Hidden Layers: δ propagates backward

    Compute Gradients: ∂Loss/∂W, ∂Loss/∂b
    
    Update: W := W - α * ∂Loss/∂W


**Example (Simplified)**

Assume one hidden layer with sigmoid:

1. z^[1]=W^[1]x+b^[1],𝑎^[1]=𝜎(𝑧^[1])

2. z^[2]=W^[2]a^[1]+b^[2],y^=σ(z^[2])

Then during backprop:

Compute δ^[2]=(y^−y)⋅σ′(z^[2])

Then δ^[1]=(W^[2])^Tδ^[2]⋅σ′(z^[1])

Finally update weights and biases.

**Summary Table**

| Step | Description                              |
| ---- | ---------------------------------------- |
| 1    | Compute prediction using forward pass    |
| 2    | Calculate loss                           |
| 3    | Compute error at output layer            |
| 4    | Propagate error backward                 |
| 5    | Compute gradients for weights and biases |
| 6    | Update weights using gradient descent    |


4. What is the purpose of the chain rule in backpropagation

**Purpose of the Chain Rule in Backpropagation**

The chain rule is the mathematical foundation of backpropagation. It allows us to efficiently compute gradients of the loss function with respect to each weight and bias in a multi-layer neural network.

**Why Is the Chain Rule Needed?**

In neural networks, outputs depend on multiple layers of nested functions. To train the network, we need to know how a change in each parameter affects the loss.

Backpropagation uses the chain rule from calculus to break down this complex dependency into smaller, manageable steps.

**What Is the Chain Rule?**

    If a variable z depends on y, which depends on x, then:

dz/dx = dz/dy⋅dy/dx
​
 
    In a neural network, the loss L depends on:

The output, which depends on:

The activations, which depend on:

The weights, biases, and inputs.

    So to compute ∂L/∂W, we apply:

∂L/∂W = ∂L/∂a ⋅ ∂a/∂z ⋅ ∂z/∂W

​
 
**How the Chain Rule Works in Backpropagation**

    Let’s say:

z^[l]=W^[l]a^[l−1]+^b[l]

a^[l]=f(z^[l])

L=Loss(a^[L],y)


To find the gradient ∂L/∂W^[l], we apply the chain rule:

∂L/∂W^[l] = ∂L/∂a^[l] ⋅ ∂a^[l]/∂z^[l] ⋅ ∂z^[l]/∂W^[l]
 
​
 
Each term is easier to compute, and this process is repeated layer by layer, from output back to the input.

**Why It’s So Powerful**

Neural networks can have millions of parameters and deep architectures.

The chain rule allows us to reuse intermediate computations during the backward pass, making training efficient and scalable.

It forms the basis of automatic differentiation used in PyTorch, TensorFlow, etc.



5.  Implement the forward propagation process for a simple neural network with one hidden layer using 
NumPy.

**Network Architecture:**

Input layer: 2 features

Hidden layer: 2 neurons (activation: ReLU)

Output layer: 1 neuron (activation: Sigmoid)

In [1]:
import numpy as np

# Activation functions
def relu(z):
    return np.maximum(0,z)

def sigmoid(z):
    return 1/(1 + np.exp(-z))

# Input data (1 sample, 2 features)
X = np.array([[0.5,-0.2]])

# Weights and biases initialization
np.random.seed(42)
W1 = np.random.randn(2,2) #(input_size, hidden_size)
b1 = np.random.randn(1,2) #(1,hidden_size)

W2 = np.random.randn(2,1) # (hidden_size, output_size)
b2 = np.random.randn(1,1) # (1, output_size)

# Forward propagation
# Layer 1 (hidden layer)
Z1 = np.dot(X,W1) + b1 # Linear step
A1 = relu(Z1) # Activation (ReLU)


# Layer 2 (output layer)
Z2 = np.dot(A1,W2) + b1 # Linear step
A2 = sigmoid(Z2) # Activation (Sigmoid)

# Print results
print("Input X:\n",X)
print("Z1 (input to hidden):\n",Z1)
print("A1 (hidden activation - ReLU):\n",A1)
print("Z2 (input to output):\n",Z2)
print("A2 (output - Sigmoid):\n",A2)

Input X:
 [[ 0.5 -0.2]]
Z1 (input to hidden):
 [[-0.11533401 -0.60787508]]
A1 (hidden activation - ReLU):
 [[0. 0.]]
Z2 (input to output):
 [[-0.23415337 -0.23413696]]
A2 (output - Sigmoid):
 [[0.44172766 0.44173171]]
