<a href="https://colab.research.google.com/github/Deepak98913/Deep_Learning_Assignments_Nov_2024/blob/main/weight_instialization_techniques_assignments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.  What is the vanishing gradient problem in deep neural networks? How does it affect training ?

Ans :- The **vanishing gradient problem** occurs when gradients—used in the backpropagation algorithm to update the weights of a neural network—become extremely small as they propagate backward through the layers, especially in deep neural networks. This causes the weights in the earlier layers of the network to receive very small updates during training, making it difficult for the model to learn effectively.

### Key Causes:
1. **Activation Functions**: Functions like **Sigmoid** or **Tanh** squash their output into a small range (0-1 for Sigmoid and -1 to 1 for Tanh). When the input to these functions is large (either positive or negative), their gradients approach zero. As the gradients are propagated backward through each layer, they get smaller, causing the learning process to stagnate in the earlier layers.
   
2. **Deep Architectures**: In very deep networks, when the gradients are passed through many layers, they tend to shrink exponentially due to the repeated application of small derivatives.

### Effects on Training:
- **Slow Learning or Stagnation**: Since the gradients are very small in the early layers, the weights in those layers barely change during training. This makes it hard for the network to learn the features in the lower layers effectively.
- **Difficulty in Convergence**: The network might take a very long time to converge, or in some cases, it might not converge at all because of the minimal weight updates.
  
### Solutions:
1. **ReLU Activation Function**: ReLU (Rectified Linear Unit) and its variants help mitigate the vanishing gradient problem because their gradient is 1 for positive inputs, avoiding the issue of gradients becoming too small.
   
2. **Weight Initialization**: Proper weight initialization, such as **Xavier/Glorot initialization** for Sigmoid/Tanh networks and **He initialization** for ReLU networks, can help maintain healthy gradient flow in deep networks.
   
3. **Batch Normalization**: This normalizes the input to each layer, reducing the effect of vanishing gradients by stabilizing the learning process.

4. **Residual Networks (ResNets)**: Using skip connections or residual connections in deep networks allows the gradients to flow more easily through the layers, combating the vanishing gradient issue by providing direct paths for gradient flow.

The vanishing gradient problem is one of the key challenges in training deep networks but can be addressed with modern techniques like those mentioned above.

# 2.  Explain how Xavier initialization addresses the vanishing gradient problem.

Ans - **Xavier initialization** (also known as **Glorot initialization**) is a weight initialization method designed to address the vanishing gradient problem, especially when training deep neural networks with activation functions like **Sigmoid** and **Tanh**.

### What Is Xavier Initialization?

Xavier initialization aims to set the initial weights of a neural network in such a way that the variance of the outputs of each layer is similar to the variance of its inputs. This helps maintain an appropriate gradient flow throughout the network during backpropagation, mitigating the vanishing gradient problem.

The method works by setting the weights of a layer randomly, but with the following constraint:
- For a layer with **\( n_{\text{in}} \)** input units and **\( n_{\text{out}} \)** output units, the weights are initialized with values drawn from a **uniform** or **normal** distribution with mean 0 and variance given by:

\[
\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}
\]

This formula ensures that the variance of the output from each neuron remains roughly constant across layers, which helps prevent the signal from either exploding or vanishing as it propagates through the network.

### How It Helps Prevent the Vanishing Gradient Problem:
1. **Balanced Signal Flow**: In a deep network, the signal (both activations and gradients) passes through many layers. If the weights are too small, the signals will shrink as they move through each layer, leading to vanishing gradients. Conversely, if the weights are too large, the gradients might become very large, causing the network to diverge (exploding gradients).
   
   Xavier initialization prevents both these issues by setting the weights such that the activations and gradients have a controlled variance. This helps maintain a stable gradient flow and prevents the gradients from becoming too small in the earlier layers.

2. **Optimizing Activation Function Behavior**: Activation functions like **Sigmoid** and **Tanh** have a limited range of outputs. For these functions, initializing weights too small leads to the inputs being in the saturation region (where the gradient is near zero), which exacerbates the vanishing gradient problem. Xavier initialization ensures that the network's activations start with a good spread of values, avoiding saturation and allowing the gradients to propagate more effectively.

3. **Consistent Gradient Magnitudes**: By carefully setting the initial variance of the weights, Xavier initialization helps ensure that the magnitudes of the gradients remain similar across all layers. This way, the network can learn more efficiently, and weight updates during backpropagation are more balanced, allowing the training process to converge more quickly and effectively.

### In Summary:
Xavier initialization addresses the vanishing gradient problem by setting the initial weights in a way that maintains a stable variance across layers, helping to ensure that both activations and gradients flow properly through deep networks. This prevents the gradients from becoming too small or too large, facilitating more efficient and effective training.

# 3. What are some common activation functions that are prone to causing vanishing gradients ?

Ans - The **vanishing gradient problem** is most commonly associated with activation functions that squash their output into a small range, particularly when the input values are large in magnitude (either very positive or very negative). The gradient of these functions becomes very small in such regions, which leads to small weight updates during backpropagation, especially in the deeper layers of the network.

Some common activation functions that are prone to causing vanishing gradients are:

### 1. **Sigmoid Function**:
   - **Equation**: \( \sigma(x) = \frac{1}{1 + e^{-x}} \)
   - The Sigmoid function maps input values to a range between 0 and 1.
   - **Vanishing Gradient Issue**: When the input to the Sigmoid function is either very large or very small (in the positive or negative direction), the output approaches 1 or 0, and the gradient of the Sigmoid function becomes very small (close to zero). This causes the gradients to vanish as they propagate backward through the network, making it difficult for the network to learn from early layers.
   
### 2. **Tanh Function (Hyperbolic Tangent)**:
   - **Equation**: \( \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} \)
   - The Tanh function is similar to Sigmoid, but its output range is between -1 and 1, rather than 0 and 1.
   - **Vanishing Gradient Issue**: Like Sigmoid, Tanh suffers from vanishing gradients. When the input is either very positive or very negative, the output approaches 1 or -1, and the derivative (gradient) approaches 0. This leads to small gradients during backpropagation and hinders learning in deep networks.
   
### 3. **Softmax Function** (in certain cases):
   - **Equation**: \( \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \)
   - Softmax is typically used in the output layer for classification tasks, especially for multi-class classification.
   - **Vanishing Gradient Issue**: While Softmax itself does not directly cause vanishing gradients, it can suffer from gradient issues when combined with activation functions like Sigmoid or Tanh, especially in multi-class settings where the output probabilities become very concentrated (approaching 0 or 1). This can cause the gradients to shrink significantly when the model is far from the correct class, leading to very slow or stalled learning.

### 4. **Sigmoid-based or Tanh-based Activations in Deep Networks**:
   - In very deep networks, even if these activation functions are not causing the gradient to vanish in each individual layer, the cumulative effect of many small gradients passing through many layers can still lead to the vanishing gradient problem. This is why deep networks with Sigmoid or Tanh activations often struggle with effective training.

### Why Do These Functions Cause Vanishing Gradients?

Both Sigmoid and Tanh functions saturate for large input values (either positive or negative). When the function saturates:
- **Sigmoid**: For very large or very small inputs, the derivative becomes very close to 0.
- **Tanh**: Similarly, for very large or small inputs, the derivative of Tanh also approaches 0.

As gradients are propagated back through the network, these small derivatives result in gradients that shrink exponentially. In deep networks, the gradients can become so small that the earlier layers (closer to the input) barely learn anything, leading to slow or ineffective training.

### Solutions to Vanishing Gradients:
- **ReLU Activation**: ReLU (Rectified Linear Unit) helps mitigate the vanishing gradient problem, as its gradient is either 0 (for negative inputs) or 1 (for positive inputs). This ensures that gradients are not diminished as they propagate backward.
- **Leaky ReLU** and **Parametric ReLU**: Variants of ReLU that allow for small gradients when the input is negative, helping avoid the "dying ReLU" problem while still preventing vanishing gradients for positive inputs.
- **Swish Activation**: A newer activation function (proposed by Google) that can also help mitigate the vanishing gradient problem.
- **Proper Weight Initialization**: Using techniques like **Xavier/Glorot** or **He initialization** to ensure that the gradients remain well-behaved during the initial stages of training.

In summary, activation functions like **Sigmoid** and **Tanh** are prone to causing vanishing gradients, especially in deep networks. Switching to ReLU-based activations or employing techniques like proper weight initialization and gradient clipping can help mitigate this problem.

# 4.  Define the exploding gradient problem in deep neural networks. How does it impact training ?

Ans :-  The **exploding gradient problem** occurs when the gradients of the loss function with respect to the model parameters (weights) become very large during the backpropagation process. This causes the model weights to update too drastically during training, which can lead to unstable training, and ultimately, the network may diverge and fail to converge to a meaningful solution.

### Causes of the Exploding Gradient Problem:
1. **Large Weight Values**: If the weights are initialized with very large values, the activations in each layer may become large, especially in deep networks. This leads to large gradients during backpropagation.
   
2. **Deep Networks**: In very deep neural networks, gradients can be propagated backward through many layers. If the gradients at each layer are large, they can accumulate and amplify exponentially as they move back through the network, resulting in gradients that become increasingly large.

3. **Certain Activation Functions**: Activation functions like **ReLU** and its variants can contribute to exploding gradients when the output becomes very large. This often happens when the network has large weight values, or when gradients are propagated through many layers.

4. **Poor Weight Initialization**: If the weights are initialized with values that are too large, this can cause large activations and consequently large gradients. Proper weight initialization techniques are crucial to avoiding this issue.

### Impact on Training:
- **Unstable Weight Updates**: When gradients are very large, the weight updates become excessively large. This can lead to drastic changes in the model parameters, causing the network to fail to converge and instead "explode" away from a useful solution.
  
- **Oscillating Loss**: The large weight updates may cause the loss function to fluctuate wildly, rather than steadily decreasing toward a minimum. This makes the training process highly unstable and prevents the model from learning effectively.

- **Numerical Instability**: Very large gradients can also lead to numerical overflow or instability, causing the model to "blow up" and result in **NaN (Not a Number)** values in the weights, gradients, or loss function.

### Example:
In deep networks, if a small change in the weights leads to a large change in the loss due to large gradients, the backpropagation process will adjust the weights so drastically that the optimization step overshoots the optimal point, resulting in very large fluctuations in the model's performance.

### Solutions to the Exploding Gradient Problem:
1. **Gradient Clipping**: This technique involves setting a threshold value for gradients. If the gradients exceed this threshold, they are scaled down to avoid excessively large updates. This ensures that gradients stay within a reasonable range and prevents the weights from diverging.
   
2. **Proper Weight Initialization**: Using initialization techniques like **He initialization** (for ReLU-based networks) can help prevent the gradients from becoming too large by ensuring the weights start with an appropriate magnitude.
   
3. **Smaller Learning Rates**: Using smaller learning rates can help prevent large updates to the weights, giving the model a chance to converge gradually instead of overshooting the optimal solution.
   
4. **Using Activation Functions like Leaky ReLU or Swish**: While ReLU can contribute to exploding gradients in certain cases, variants like **Leaky ReLU** and **Swish** can help control the magnitude of activations and gradients more effectively.
   
5. **Batch Normalization**: Normalizing the activations of each layer using batch normalization can help regulate the flow of gradients, making them less likely to explode by keeping the activations within a reasonable range.

### In Summary:
The exploding gradient problem occurs when gradients become excessively large, leading to unstable weight updates, numerical instability, and an inability to converge during training. It is more common in deep networks and can be mitigated by techniques such as gradient clipping, proper weight initialization, and using smaller learning rates.

# 5. What is the role of proper weight initialization in training deep neural networks ?

Ans :-  **Proper weight initialization** plays a crucial role in training deep neural networks by ensuring that the network learns effectively and converges quickly. When training deep networks, the starting values of the weights have a significant impact on the gradients during backpropagation and how the network performs. Poor initialization can lead to issues such as vanishing or exploding gradients, slow convergence, or even failure to converge.

### Role of Proper Weight Initialization:

1. **Avoiding the Vanishing and Exploding Gradient Problems**:
   - **Vanishing Gradient**: If weights are initialized too small, the gradients during backpropagation can become too small, causing slow learning or stagnation (especially with activation functions like Sigmoid or Tanh).
   - **Exploding Gradient**: If weights are initialized too large, the gradients can grow exponentially as they are propagated backward through the layers, leading to excessively large weight updates that destabilize the training process.

   Proper initialization ensures that the gradients neither vanish nor explode, allowing for more stable and efficient learning, especially in deep networks.

2. **Balanced Gradient Flow**:
   - The primary objective of proper weight initialization is to maintain a balanced flow of gradients throughout the network. If the weights are initialized correctly, the gradients will neither vanish (become too small) nor explode (become too large) as they are propagated back through the layers. This helps prevent training instability and ensures that learning occurs across all layers of the network.

3. **Faster Convergence**:
   - Proper initialization can significantly speed up the convergence of the model during training. By starting with well-scaled weights, the optimization process (typically using gradient-based methods like SGD or Adam) can reach the optimal solution more quickly, reducing the number of training epochs required.

4. **Breaking Symmetry**:
   - If all weights are initialized with the same value (e.g., zero), the neurons in a layer will learn the same features during training. This means they will have identical gradients and update in the same direction, resulting in poor learning.
   - By initializing weights randomly (but with proper scaling), neurons are encouraged to learn different features, which helps the network learn more diverse and useful representations.

### Common Weight Initialization Methods:
Different methods have been proposed to initialize the weights properly, depending on the activation function used and the depth of the network. Here are some of the most common techniques:

#### 1. **Zero Initialization (Not Recommended)**:
   - Initializing all weights to zero leads to symmetry problems, where every neuron in a layer learns the same features. This is why **zero initialization** is generally avoided in deep networks.

#### 2. **Random Initialization**:
   - This involves initializing weights randomly from a uniform or normal distribution, often with small values. However, random initialization alone may not address the vanishing or exploding gradient problem effectively.

#### 3. **Xavier/Glorot Initialization**:
   - **Equation**: \( \text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}} \)
   - Used for activation functions like **Sigmoid** and **Tanh**.
   - It scales the weights to maintain a balance between the variance of activations and gradients across layers. This helps prevent vanishing or exploding gradients, especially in deep networks.
   
#### 4. **He Initialization**:
   - **Equation**: \( \text{Var}(W) = \frac{2}{n_{\text{in}}} \)
   - Designed for **ReLU** and its variants, such as **Leaky ReLU**.
   - Since ReLU activations are not symmetric (they have a zero output for negative inputs), He initialization ensures that the weights are scaled to account for the fact that only half of the neurons will be active at any time, avoiding the vanishing gradient problem by maintaining a larger variance.

#### 5. **LeCun Initialization**:
   - **Equation**: \( \text{Var}(W) = \frac{1}{n_{\text{in}}} \)
   - Optimized for **Leaky ReLU** and **tanh** activation functions. It works well for networks with a large number of input units.

#### 6. **Orthogonal Initialization**:
   - The weight matrix is initialized to be orthogonal, which can help maintain the stability of the gradient flow during training.
   - Works well for deep networks and recurrent neural networks (RNNs).

#### 7. **Unit Variance Initialization**:
   - For each layer, the weights are initialized so that the variance of the activations is close to 1. This is achieved through methods like **Xavier** or **He** initialization.

### Benefits of Proper Weight Initialization:
- **Improved Gradient Flow**: Ensures that the gradients neither vanish nor explode as they propagate through layers, leading to more stable training.
- **Faster Training**: Reduces the number of epochs required for the model to converge to an optimal solution.
- **Preventing Dead Neurons**: For activation functions like ReLU, proper initialization prevents neurons from getting stuck in inactive regions (outputting zeros all the time).
- **Easier Optimization**: Provides a good starting point for optimization algorithms, improving their ability to find the optimal weights.

### Conclusion:
Proper weight initialization is critical for training deep neural networks effectively. It helps avoid issues like vanishing and exploding gradients, accelerates convergence, and ensures that the optimization process works smoothly. Choosing the right initialization method, based on the activation function and the depth of the network, can make a significant difference in the performance of deep learning models.


# 6.  Explain the concept of batch normalization and its impact on weight initialization techniques.

Ans :-  **Batch normalization (BN)** is a technique used in deep learning to improve the training speed, stability, and performance of neural networks. It addresses several issues, including the **internal covariate shift**, which occurs when the distribution of the activations changes during training as the model learns, leading to unstable and slow convergence. Batch normalization helps stabilize the learning process and allows the network to be trained faster, with a larger learning rate.

### Concept of Batch Normalization:

Batch normalization works by normalizing the activations of each layer in a mini-batch before passing them through the activation function. Specifically, for each mini-batch, the following steps are performed:

1. **Compute the Mean and Variance**: For each feature (neuron output) in the mini-batch, calculate the mean and variance across all the examples in the mini-batch.
   
2. **Normalize**: Subtract the mean and divide by the standard deviation (using the variance) to make the activations have a **mean of 0** and a **variance of 1**. This step normalizes the output for each feature.
   
3. **Scale and Shift**: To ensure that the network can still learn the best activation distribution, **learnable parameters** \( \gamma \) (scale) and \( \beta \) (shift) are introduced. These parameters allow the network to scale and shift the normalized activations to any desired range, giving the model the flexibility to learn optimal activations.

Mathematically, for a given activation \( x \) in a layer, batch normalization can be expressed as:

\[
\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}
\]

where:
- \( \mu \) is the mean of the mini-batch.
- \( \sigma^2 \) is the variance of the mini-batch.
- \( \epsilon \) is a small constant added for numerical stability.
- \( \hat{x} \) is the normalized output.
  
Then, the output of the batch normalization layer is:

\[
y = \gamma \hat{x} + \beta
\]

where:
- \( \gamma \) (scale) and \( \beta \) (shift) are learnable parameters.

### Impact of Batch Normalization on Weight Initialization:

Batch normalization can have a significant impact on the choice and effectiveness of **weight initialization** techniques. Here's how:

1. **Relaxed Weight Initialization Requirements**:
   - One of the key benefits of batch normalization is that it reduces the sensitivity of the network to weight initialization. Without batch normalization, proper weight initialization (such as Xavier or He initialization) is critical to prevent issues like vanishing or exploding gradients. However, with batch normalization, the activations are normalized during training, which stabilizes the gradients.
   - This means that **even with poor weight initialization**, batch normalization can help prevent the gradients from vanishing or exploding by normalizing the outputs at each layer. As a result, the network can learn effectively, even with suboptimal initialization.

2. **Faster Convergence with Larger Learning Rates**:
   - Batch normalization allows for **faster convergence** during training. This is partly because it makes the optimization process smoother, as the network's internal activations are normalized. With batch normalization, the model is less likely to get stuck in regions where gradients vanish or explode, making it possible to use **larger learning rates** without destabilizing the training process.
   - When using larger learning rates, networks with batch normalization can avoid oscillating or diverging during training, which might occur in networks without BN (especially if the weights are initialized poorly).

3. **Decreased Dependence on Specific Initialization Methods**:
   - With batch normalization, the dependence on specific weight initialization schemes is decreased. For example:
     - **Without batch normalization**, weight initialization methods like **He initialization** (for ReLU activations) or **Xavier initialization** (for Sigmoid/Tanh activations) are crucial for proper gradient flow.
     - **With batch normalization**, these concerns become less critical because BN normalizes the activations at each layer, ensuring that the gradient flow remains stable. However, it's still a good practice to use reasonable initializations to prevent issues from arising in the absence of BN.

4. **Improved Training Stability**:
   - Batch normalization stabilizes the learning process by reducing the issue of **internal covariate shift**, where the distribution of layer inputs changes during training as the model's parameters change. This stabilization reduces the need for careful tuning of the weight initialization, since BN ensures that the activations stay within a more predictable range throughout training.

5. **Reduction in the Need for Regularization**:
   - Since batch normalization already normalizes the activations, it reduces the need for other forms of regularization, such as **Dropout**. BN has a slight regularizing effect by introducing noise during training through mini-batch statistics, which can sometimes lead to better generalization without the need for excessive regularization methods.

### Summary of Key Points:
- **Batch normalization** normalizes the activations during training to have a mean of 0 and a variance of 1, improving training stability, speed, and performance.
- It reduces the sensitivity of the model to the **weight initialization** scheme, meaning that less strict initialization methods can be used effectively with BN.
- BN allows for **larger learning rates** and faster convergence, helping the network train more efficiently.
- While weight initialization still matters, batch normalization reduces the dependence on specific initialization schemes (like Xavier or He), making training more robust and less prone to problems like vanishing/exploding gradients.
  
In summary, **batch normalization** helps stabilize the training process and mitigates issues caused by improper weight initialization, allowing for faster convergence and reducing the need for fine-tuned initialization methods. However, it's still beneficial to use good initialization techniques (such as He or Xavier) in conjunction with batch normalization for the best results.

# 7. Implement He initialization in Python using TensorFlow or PyTorch ?

Ans :-  Here's how you can implement **He Initialization** in Python using both **TensorFlow** and **PyTorch**:

### 1. **He Initialization in TensorFlow**:

In TensorFlow, you can use the `tf.keras.initializers.HeNormal()` or `tf.keras.initializers.HeUniform()` for He initialization, depending on whether you want the weights initialized using a normal distribution or a uniform distribution.

#### Example using `HeNormal` (Normal Distribution):

```python
import tensorflow as tf

# Define a simple neural network model using He initialization
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', kernel_initializer=tf.keras.initializers.HeNormal(), input_shape=(784,)),
    tf.keras.layers.Dense(64, activation='relu', kernel_initializer=tf.keras.initializers.HeNormal()),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Print the model summary
model.summary()
```

#### Example using `HeUniform` (Uniform Distribution):

```python
import tensorflow as tf

# Define a simple neural network model using He initialization
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', kernel_initializer=tf.keras.initializers.HeUniform(), input_shape=(784,)),
    tf.keras.layers.Dense(64, activation='relu', kernel_initializer=tf.keras.initializers.HeUniform()),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Print the model summary
model.summary()
```

### 2. **He Initialization in PyTorch**:

In PyTorch, He initialization can be applied to layers by using `torch.nn.init.kaiming_normal_()` or `torch.nn.init.kaiming_uniform_()` for normal and uniform distributions, respectively.

#### Example using `kaiming_normal_` (Normal Distribution):

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(784, 128)  # First fully connected layer
        self.fc2 = nn.Linear(128, 64)   # Second fully connected layer
        self.fc3 = nn.Linear(64, 10)    # Output layer

        # Apply He initialization (normal distribution) to the weights
        torch.nn.init.kaiming_normal_(self.fc1.weight, mode='fan_out', nonlinearity='relu')
        torch.nn.init.kaiming_normal_(self.fc2.weight, mode='fan_out', nonlinearity='relu')
        torch.nn.init.kaiming_normal_(self.fc3.weight, mode='fan_out', nonlinearity='relu')

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create the model
model = SimpleNN()

# Print the model architecture
print(model)
```

#### Example using `kaiming_uniform_` (Uniform Distribution):

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(784, 128)  # First fully connected layer
        self.fc2 = nn.Linear(128, 64)   # Second fully connected layer
        self.fc3 = nn.Linear(64, 10)    # Output layer

        # Apply He initialization (uniform distribution) to the weights
        torch.nn.init.kaiming_uniform_(self.fc1.weight, mode='fan_out', nonlinearity='relu')
        torch.nn.init.kaiming_uniform_(self.fc2.weight, mode='fan_out', nonlinearity='relu')
        torch.nn.init.kaiming_uniform_(self.fc3.weight, mode='fan_out', nonlinearity='relu')

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create the model
model = SimpleNN()

# Print the model architecture
print(model)
```

### Explanation:
- **TensorFlow**: The `HeNormal()` initializer in TensorFlow initializes weights from a normal distribution with a mean of 0 and a standard deviation of \( \sqrt{\frac{2}{n_{\text{in}}}} \), where \( n_{\text{in}} \) is the number of input units to the layer.
  
- **PyTorch**: `torch.nn.init.kaiming_normal_()` is the function used to apply He initialization. It initializes the weights using a normal distribution with the same standard deviation \( \sqrt{\frac{2}{n_{\text{in}}}} \) for ReLU activations. Similarly, `kaiming_uniform_()` uses a uniform distribution instead.

Both TensorFlow and PyTorch allow for easy customization of the initialization process, and the use of He initialization can help prevent issues like vanishing gradients, especially for ReLU-based networks.