# Backpropagation and Neural Networks

Backpropagation, short for "backward propagation of errors," is the key algorithm that powers the training of artificial neural networks. It is a way of enabling deep networks to learn complex patterns in data through more efficient weight updates. 

The core idea behind backpropagation is to compute how much each network parameter (such as a weight or bias) contributes to the prediction error and to then adjust these parameters to minimize the overall error. This process relies on the chain rule of calculus to propagate the error backwards through the network, starting from the output layer and moving towards the input layer.

Backpropagation works by iteratively updating weights using gradient descent. The algorithm involves two key phases: **the forward pass**, where the input data is passed through the network to generate predictions, and **the backward pass**, where the error is calculated and gradients are computed to update the weights. By iteratively reducing the error with each epoch, the network "learns" the optimal parameters.

![](https://miro.medium.com/v2/resize:fit:1080/0*d9yJ5xIqdbDyjCYR.gif)

## Back Propogation and The Chain Rule

A key mathematical concept underlying backpropagation is the **chain rule**, which is used to compute derivatives of composite functions. Since a neural network is essentially a nested composition of functions — where each layer applies an activation function to a linear combination of inputs — the chain rule enables us to compute how changes in a given parameter (like a weight or bias) affect the final network output.

The chain rule states that if we have a composite function $f(x) = g(h(x))$, then the derivative of $f$ with respect to $x$ is given by:  
$$
\frac{df}{dx} = \frac{dg}{dh} \cdot \frac{dh}{dx}
$$  

In the context of backpropagation, this means that we can compute the gradient of the overall loss $L$ with respect to the weights $w$ by sequentially propagating gradients backward through the network. For example, if $L$ depends on the output of layer $k$ and layer $k$ depends on layer $k-1$, the chain rule tells us that:  

$$
\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_k} \cdot \frac{\partial a_k}{\partial z_k} \cdot \frac{\partial z_k}{\partial w}
$$  

where:
- $a_k$ is the activation of the layer,
- $z_k$ is the linear combination of inputs to that layer, and
- $w$ represents the weights connecting to the layer.

By applying the chain rule iteratively from the output layer back to the input layer, backpropagation efficiently computes the gradients needed to update all the parameters in the network. This is the key to optimizing deep networks through gradient descent.


## Simple Backpropogation Example

### Network Setup

Let’s consider a simple neural network with:
- 1 input layer (2 features),
- 1 hidden layer (2 neurons), and
- 1 output neuron.

We will manually walk through the **forward pass** and **backpropagation** using the following data:

**Inputs:**  
$$
X = \begin{bmatrix} 0.5 \\ 0.2 \end{bmatrix}
$$

**True output:** $ y = 1 $

**Weights and biases:**  
$$
W_1 = \begin{bmatrix} 0.4 & 0.1 \\ 0.3 & 0.7 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}
$$  

$$
W_2 = \begin{bmatrix} 0.5 & 0.6 \end{bmatrix}, \quad b_2 = 0.3
$$

### Walkthrough:

#### Forward Pass:
Compute the activations $ a_1 $ and output $ a_2 $.

#### Loss Calculation:
Compute the error using a loss function, like **Mean Squared Error**:  
$$
L = \frac{1}{2} (a_2 - y)^2
$$

#### Backward Pass:  
Compute gradients using the chain rule as described in the backpropagation section.

#### Weight Updates:
Update the weights using gradient descent, where $ \eta $ is the learning rate:  

$$
W^{(l)} = W^{(l)} - \eta \frac{\partial L}{\partial W^{(l)}}
$$  




## The Forward Pass

In a neural network, the **forward pass** is the process of passing the input data through the network to produce an output. The input is propagated layer by layer through **linear transformations** and **non-linear activation functions** until a final prediction is obtained.

At each layer $ l $, the input is transformed using the following steps:

---

#### 1. Linear Transformation:
The inputs are multiplied by weights and added to biases:  
$$
z^{(l)} = W^{(l)} \cdot a^{(l-1)} + b^{(l)}
$$  

where:  
- $ z^{(l)} $ is the pre-activation value of layer $ l $,  
- $ W^{(l)} $ is the weight matrix for layer $ l $,  
- $ a^{(l-1)} $ is the activation from the previous layer, and  
- $ b^{(l)} $ is the bias vector for layer $ l $.

---

#### **2. Activation Function:**  
A non-linear activation function is applied to the pre-activation $ z^{(l)} $ to get the activation of the current layer:  
$$
a^{(l)} = \sigma(z^{(l)})
$$  

Common activation functions include sigmoid, ReLU, and tanh.


In [1]:
import numpy as np

# Define the sigmoid activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Input data (2 features, 1 sample)
X = np.array([[0.5], [0.2]])

# Initialize weights and biases for a 2-layer neural network
W1 = np.array([[0.4, 0.1], [0.3, 0.7]])  # Weights for the hidden layer
b1 = np.array([[0.1], [0.2]])            # Biases for the hidden layer

W2 = np.array([[0.5, 0.6]])              # Weights for the output layer
b2 = np.array([[0.3]])                   # Bias for the output layer

# Forward pass
z1 = np.dot(W1, X) + b1  # Linear transformation for the hidden layer
a1 = sigmoid(z1)         # Activation for the hidden layer

z2 = np.dot(W2, a1) + b2  # Linear transformation for the output layer
a2 = sigmoid(z2)          # Activation for the output layer (final prediction)

print("Output of the neural network:", a2)

Output of the neural network: [[0.72346724]]


## Backpropagation

The goal of backpropagation is to calculate the **gradient of the loss function** with respect to the network’s weights and biases, so that these parameters can be updated using gradient descent. The key to backpropagation is the **chain rule**, which allows us to propagate the error backward through the network.


1. **Compute the Loss:**  
    Assume a loss function $L$ (e.g., Mean Squared Error or Cross-Entropy Loss).

2. **Compute Output Error:**  
    Compute the derivative of the loss with respect to the output activation $a^{(L)}$:

    $$
    \delta^{(L)} = \frac{\partial L}{\partial a^{(L)}} \cdot \sigma'(z^{(L)})
    $$

    Here, $\delta^{(L)}$ is the error signal at the output layer, and $\sigma'(z)$ is the derivative of the activation function.

3. **Backpropagate Error:**  
    For each previous layer $l$, propagate the error backward:

    $$
    \delta^{(l)} = \big( W^{(l+1)} \big)^T \cdot \delta^{(l+1)} \cdot \sigma'(z^{(l)})
    $$

4. **Calculate Gradients:**  
    The gradients of the loss with respect to the weights and biases are:

    $$
    \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \cdot \big( a^{(l-1)} \big)^T \quad \text{and} \quad \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}
    $$


In [2]:
# Define the derivative of the sigmoid function
def sigmoid_derivative(z):
    return sigmoid(z) * (1 - sigmoid(z))

# Assume the network's output and the true label
y_true = np.array([[1]])  # True label
loss_derivative = a2 - y_true  # Derivative of the Mean Squared Error loss

# Backpropagation
delta2 = loss_derivative * sigmoid_derivative(z2)  # Output layer error
dW2 = np.dot(delta2, a1.T)  # Gradient for W2
db2 = delta2  # Gradient for b2

delta1 = np.dot(W2.T, delta2) * sigmoid_derivative(z1)  # Hidden layer error
dW1 = np.dot(delta1, X.T)  # Gradient for W1
db1 = delta1  # Gradient for b1

print("Gradients for W1:", dW1)
print("Gradients for W2:", dW2)

Gradients for W1: [[-0.00337071 -0.00134828]
 [-0.00390986 -0.00156394]]
Gradients for W2: [[-0.03205042 -0.03430665]]


## Modern Code Example

In [25]:
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#### Prepare The Data

In [26]:
# Load and preprocess the Iris dataset
iris = load_iris()
X = iris.data  # Features (4 features per sample)
y = (iris.target != 0).astype(int)  # Binary classification: class 0 vs. others

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Build And Initialize The Neural Network

In [35]:
# Build a simple neural network using TensorFlow
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(4,)),  # Hidden layer with 10 neurons
    tf.keras.layers.Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Initialize optimizer and loss function
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
loss_fn = tf.keras.losses.BinaryCrossentropy()

#### Training The Model

In [36]:
# Training loop with manual backpropagation using tf.GradientTape
epochs = 1000

for epoch in range(epochs):
    with tf.GradientTape() as tape:
        # Forward pass: compute predictions
        predictions = model(X_train, training=True)
        # Compute the loss
        loss = loss_fn(y_train, predictions)
    
    # Compute gradients (backpropagation)
    gradients = tape.gradient(loss, model.trainable_variables)
    
    # Apply gradients using the optimizer
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    # Print the loss every 10 epochs
    if epoch % 50 == 0:
        print(f"Epoch {epoch}, Loss: {loss.numpy()}")

Epoch 0, Loss: 0.5722711086273193
Epoch 10, Loss: 0.35702794790267944
Epoch 20, Loss: 0.2475539594888687
Epoch 30, Loss: 0.1799178123474121
Epoch 40, Loss: 0.13464634120464325
Epoch 50, Loss: 0.10376274585723877
Epoch 60, Loss: 0.08214350044727325
Epoch 70, Loss: 0.06658688187599182
Epoch 80, Loss: 0.05512924864888191
Epoch 90, Loss: 0.04649712145328522
Epoch 100, Loss: 0.03983476012945175
Epoch 110, Loss: 0.034591659903526306
Epoch 120, Loss: 0.030396215617656708
Epoch 130, Loss: 0.026984866708517075
Epoch 140, Loss: 0.024172551929950714
Epoch 150, Loss: 0.021833669394254684
Epoch 160, Loss: 0.019858192652463913
Epoch 170, Loss: 0.01821255311369896
Epoch 180, Loss: 0.016815833747386932
Epoch 190, Loss: 0.01560197863727808
Epoch 200, Loss: 0.014538545161485672
Epoch 210, Loss: 0.013600770384073257
Epoch 220, Loss: 0.012769242748618126
Epoch 230, Loss: 0.012026762589812279
Epoch 240, Loss: 0.011360087431967258
Epoch 250, Loss: 0.010758980177342892
Epoch 260, Loss: 0.010214472189545631
E

#### Evaluate The Model

In [37]:
# Evaluate the model
test_loss = loss_fn(y_test, model(X_test)).numpy()
print(f"\nFinal test loss: {test_loss}")


Final test loss: 0.0016189967282116413


In [38]:
# Predict on test data and compute accuracy
predictions = model.predict(X_test)
predicted_classes = (predictions > 0.5).astype(int)  # Convert probabilities to binary labels (0 or 1)

# Calculate accuracy
accuracy = (predicted_classes.flatten() == y_test).mean()  # Flatten predictions and compare with true labels

print(f"Final test accuracy: {accuracy * 100:.2f}%")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 142ms/step
Final test accuracy: 100.00%


#### Make Predictions

In [39]:
# Make predictions on test data
predictions = model.predict(X_test)
predicted_classes = (predictions > 0.5).astype(int)  # Convert probabilities to binary labels (0 or 1)

# Display predictions and ground truth
for i in range(5):  # Display the first 5 predictions
    print(f"Sample {i}: Predicted = {predicted_classes[i][0]}, True = {y_test[i]}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
Sample 0: Predicted = 1, True = 1
Sample 1: Predicted = 0, True = 0
Sample 2: Predicted = 1, True = 1
Sample 3: Predicted = 1, True = 1
Sample 4: Predicted = 1, True = 1


## Regression Example

In [40]:
import tensorflow as tf
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

#### Prepare The Dataset

In [41]:
# Load and preprocess the Boston Housing dataset
boston = load_boston()
X = boston.data  # Features (number of rooms, location, etc.)
y = boston.target.reshape(-1, 1)  # House prices reshaped to 2D

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

#### Build And Compile Neural Network

In [42]:
# Build a simple neural network using TensorFlow
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train.shape[1],)),  # Input layer (13 features)
    tf.keras.layers.Dense(64, activation='relu'),  # Hidden layer with 64 neurons
    tf.keras.layers.Dense(1)  # Output layer for regression (1 output)
])

In [43]:
# Compile the model with the optimizer and loss function
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
              loss='mean_squared_error',
              metrics=['mae'])

In [44]:
# Train the model
history = model.fit(X_train, y_train, epochs=100, validation_split=0.1, verbose=1)

Epoch 1/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 24ms/step - loss: 552.1786 - mae: 21.6182 - val_loss: 397.7032 - val_mae: 18.3852
Epoch 2/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 359.6962 - mae: 16.6409 - val_loss: 168.2371 - val_mae: 11.1015
Epoch 3/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 141.9568 - mae: 9.1925 - val_loss: 70.9432 - val_mae: 6.0898
Epoch 4/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 81.3917 - mae: 6.9396 - val_loss: 43.2335 - val_mae: 4.3369
Epoch 5/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 33.5421 - mae: 4.5052 - val_loss: 47.3297 - val_mae: 4.7876
Epoch 6/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 30.0625 - mae: 3.8184 - val_loss: 47.7934 - val_mae: 4.7817
Epoch 7/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s

#### Evaluate The Model

In [45]:
# Evaluate the model on test data
test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
print(f"\nFinal test loss (MSE): {test_loss:.4f}")
print(f"Final test MAE: {test_mae:.4f}")


Final test loss (MSE): 12.3704
Final test MAE: 2.3685


#### Make Predictions

In [46]:
# Make predictions on new data
predictions = model.predict(X_test[:5])  # Predict on the first 5 test samples

# Display predictions and ground truth
for i in range(5):
    print(f"Sample {i + 1}: Predicted price = {predictions[i][0]:.2f}, Actual price = {y_test[i][0]:.2f}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step
Sample 1: Predicted price = 26.12, Actual price = 23.60
Sample 2: Predicted price = 34.66, Actual price = 32.40
Sample 3: Predicted price = 15.34, Actual price = 13.60
Sample 4: Predicted price = 22.22, Actual price = 22.80
Sample 5: Predicted price = 17.18, Actual price = 16.10


## What is missing here? 

There is no explicit call to do backward propogation here. Instead, TensorFlow automatically handles both the forward pass and backpropagation during training when model.fit() is called.

During the forward pass, TensorFlow propagates the input through the network, computing the output by applying the weights, biases, and activation functions at each layer. 

It then calculates the loss by comparing the predicted output to the actual target using the specified loss function (e.g., mean squared error). 

In the backward pass (backpropagation), TensorFlow computes the gradients of the loss with respect to the model parameters (weights and biases) using automatic differentiation. The optimizer (e.g., Adam) updates the parameters using these gradients to minimize the loss.

**This entire process is automated and optimized within TensorFlow using the computational graph and tf.GradientTape internally.** As a result, you don’t need to manually compute gradients or update weights as TensorFlow does these steps under the hood.

In [54]:
# Build a simple neural network using TensorFlow
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X_train.shape[1],)),  # Input layer (13 features)
    tf.keras.layers.Dense(64, activation='relu'),  # Hidden layer with 64 neurons
    tf.keras.layers.Dense(1)  # Output layer for regression (1 output)
])

In [55]:
# Compile the model with the optimizer and loss function
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
              loss='mean_squared_error',
              metrics=['mae'])

In [56]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
loss_fn = tf.keras.losses.MeanSquaredError()

# Manual training loop with backpropagation
epochs = 100
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        predictions = model(X_train, training=True)  # Forward pass
        loss = loss_fn(y_train, predictions)  # Compute loss
    
    # Backpropagation: compute gradients
    gradients = tape.gradient(loss, model.trainable_variables)
    
    # Apply gradients to update weights
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.numpy()}")

Epoch 0, Loss: 603.97705078125
Epoch 10, Loss: 426.5850830078125
Epoch 20, Loss: 225.91990661621094
Epoch 30, Loss: 92.0041732788086
Epoch 40, Loss: 56.11149597167969
Epoch 50, Loss: 32.889400482177734
Epoch 60, Loss: 26.908130645751953
Epoch 70, Loss: 23.291658401489258
Epoch 80, Loss: 20.429279327392578
Epoch 90, Loss: 18.548315048217773


In [57]:
# Evaluate the model on test data
test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
print(f"\nFinal test loss (MSE): {test_loss:.4f}")
print(f"Final test MAE: {test_mae:.4f}")


Final test loss (MSE): 19.0866
Final test MAE: 2.9208


In [58]:
# Make predictions on new data
predictions = model.predict(X_test[:5])  # Predict on the first 5 test samples

# Display predictions and ground truth
for i in range(5):
    print(f"Sample {i + 1}: Predicted price = {predictions[i][0]:.2f}, Actual price = {y_test[i][0]:.2f}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step
Sample 1: Predicted price = 29.00, Actual price = 23.60
Sample 2: Predicted price = 32.18, Actual price = 32.40
Sample 3: Predicted price = 22.53, Actual price = 13.60
Sample 4: Predicted price = 27.99, Actual price = 22.80
Sample 5: Predicted price = 17.18, Actual price = 16.10


## Vanishing And Exploding Gradients

As neural networks grow deeper (with more layers), training them can become unstable due to challenges such as vanishing gradients and exploding gradients. These challenges directly affect the performance of the network, causing either slow learning or unstable weight updates.

#### Vanishing Gradients

  During backpropagation, gradients are computed using the chain rule and propagated backward from the output layer to earlier layers. As the gradient flows backward, it is multiplied by small derivatives, especially if the network uses activation functions like **sigmoid** or **tanh**. This multiplication causes the gradients to become smaller and smaller as they move toward the input layer, effectively “vanishing.”

- **Why is this a problem?**  
  - Weights in the earlier layers are updated very slowly or not at all, making it difficult for the network to learn meaningful features.  
  - This is particularly problematic for deep networks and recurrent neural networks (RNNs), where gradients may vanish before reaching the initial layers.

- **Explaining the math:**  
  Suppose you have a network with several layers, and the activation function is **sigmoid**. The derivative of the sigmoid function is given by:  
  
  $$
  \sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))
  $$
  
  The maximum derivative of the sigmoid function is **0.25**. As the chain rule multiplies these small derivatives across layers, the gradients exponentially decrease, leading to very small updates.


#### Exploding Gradients

  The opposite of vanishing gradients occurs when the gradients grow exponentially large during backpropagation. This typically happens when the weights are initialized with large values or when the network is poorly tuned.

- **Why is this a problem?**  
  - Large gradients result in **unstable weight updates**, causing the loss to diverge rather than converge.  
  - The model fails to learn, and the training process may halt due to numerical instability.

- **Instability due to large gradients:**  
  Consider a network using **ReLU** activation where the gradient does not saturate like in sigmoid or tanh. If the gradients of the loss with respect to the weights accumulate too quickly (due to large activations or poorly initialized weights), the updates during gradient descent can become excessively large, resulting in erratic or diverging loss values.

## Ways Of Correcting Vanishing And Exploding Gradients

### Activation Function Choices

The choice of activation functions can significantly affect how gradients propagate through the network.  

- **Sigmoid and tanh** tend to cause vanishing gradients due to their saturating outputs.  
- **ReLU (Rectified Linear Unit)** and its variants (e.g., **Leaky ReLU**, **Parametric ReLU**) are preferred because they do not saturate for positive inputs, helping alleviate vanishing gradient problems.

**Example:**  
- Using **ReLU**:  
  $$
  f(x) = \max(0, x)
  $$
  
  The derivative is either **1** (for positive inputs) or **0** (for negative inputs). This ensures that the gradient does not decay as rapidly as in sigmoid or tanh.

### Batch Normalization

Batch normalization normalizes the inputs to each layer, ensuring that the values lie within a stable range during training.  

- **How it helps:** By normalizing the inputs, batch normalization prevents the gradients from becoming too small or too large.  
- **Where it’s applied:** Typically applied before or after the activation function within each layer.
- **Effect of batch normalization:**  
    - Reduces internal covariate shift (where input distributions change during training). 
    - Helps accelerate training by making the gradients more stable.


In [None]:
from tensorflow.keras.layers import Dense, BatchNormalization, ReLU
from tensorflow.keras import Sequential

# Adding batch normalization in a neural network
model = Sequential([
    Dense(64, input_shape=(10,)),
    BatchNormalization(),  # Normalizes the inputs to this layer
    ReLU(),  # Activation function after normalization
    Dense(1)  # Output layer
])

### Weight Initialization Techniques

Proper weight initialization can reduce both vanishing and exploding gradients by ensuring that the initial values of the weights do not lead to excessively large or small activations.

- **Xavier Initialization (Glorot Initialization):**  
  Used for layers with **sigmoid** or **tanh** activations.  
  $$
  W \sim \mathcal{U} \left( -\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}} \right)
  $$

- **He Initialization:**  
  Used for layers with **ReLU** activations.  
  $$
  W \sim \mathcal{N} \left( 0, \frac{2}{n_{\text{in}}} \right)
  $$


In [None]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import HeNormal

model = Sequential([
    Dense(64, input_shape=(10,), kernel_initializer=HeNormal()),  # He initialization for ReLU
    tf.keras.layers.ReLU(),
    Dense(1)
])

### Residual Connections (ResNets)

Residual networks (ResNets) address vanishing gradients in very deep networks by allowing the gradient to flow more directly through the network. Instead of passing the input through many layers sequentially, **residual connections** skip layers by adding the input to the output of a layer.

**How it helps:**  
- Residual connections help maintain a strong gradient signal even in deep networks by allowing the network to learn **identity mappings**.

**Residual connection formula:**  

Given the input $x$ and a function $F(x)$ representing the transformation of the layer:  

$$
\text{Output} = F(x) + x
$$


In [None]:
import tensorflow as tf

class ResidualBlock(tf.keras.Model):
    def __init__(self, units):
        super(ResidualBlock, self).__init__()
        self.dense1 = tf.keras.layers.Dense(units, activation='relu')
        self.dense2 = tf.keras.layers.Dense(units)

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        return x + inputs  # Residual connection