<div class="alert alert-block alert-warning">

The smoothness of GELU, can lead to better optimization properties
during training, as it allows for more nuanced adjustments to the model's parameters. 

In contrast, ReLU has a sharp corner at zero, which can sometimes make optimization harder,
especially in networks that are very deep or have complex architectures. 

Moreover, unlike RELU, which outputs zero for any negative input, GELU allows for a small, non-zero output
for negative values. 

This characteristic means that during the training process, neurons that
receive negative input can still contribute to the learning process, albeit to a lesser extent
than positive inputs.


![image.png](attachment:129403d1-b336-4239-a8ea-0bc112e1e9fd.png)

In [2]:
import torch.nn as nn
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), # expansion
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),# contraction
        )

    def forward(self, x):
        return self.layers(x)

### importance of SKIP CONNECTIONS

![image.png](attachment:193bad0b-902c-472c-be05-ebc1fbdb97b7.png)


![image.png](attachment:b4b50582-9987-416d-a6ac-c5cd47d0aa29.png)