### GELU Activation

# ![GELU Activation Function](gelu_activation.png)

#### Why GELU?

GELU is a type of activation function used in neural networks. It stands for Gaussian Error Linear Unit.

- GELU is a smooth function and is differentiable everywhere. [unlike ReLU which is not differentiable at 0]
- Prevents the dead neuron problem [ReLU can cause dead neurons in case of negative values]
- It just works better than ReLU acording to llm experiments

![GELU vs ReLU](gelu_vs_relu.png)


In [1]:
GPT2_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12, 
    "n_layers": 12,
    "drop_rate": 0.1, 
    "qkv_bias": False
}


In [None]:
import torch
class FeedForward(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layers = torch.nn.Sequential(
            torch.nn.Linear(config["emb_dim"], 4 * config["emb_dim"]), ## Expansion
            torch.nn.GELU(), ## Activation
            torch.nn.Linear(4 * config["emb_dim"], config["emb_dim"]) ## Compression
        )

    def forward(self, x):
        return self.layers(x)
      

In [4]:
ffn = FeedForward(GPT2_CONFIG_124M)

x = torch.randn(2, 3, GPT2_CONFIG_124M["emb_dim"])

out = ffn(x)

print(out.shape)


torch.Size([2, 3, 768])
