## Implementing Llama2 3B


In [3]:
import torch
import torch.nn as nn

### **Normalization**

One of the main divergences from the original GPT transformer architecture is the normalization technique. The purpose of normalization is to recenter the model weights and inputs to help stablize training and boost model convergence.

Traditional **LayerNorm** normalizes the layers using the mean of variance across the feature dimension. There are three hyperparameters $\epsilon$, setting a floor for the fraction denominator, $\gamma$ a learnable scaling parameter and $\beta$ applying a learnable shift parameter.
$$y=\frac{x-E[x]}{\sqrt{Var(x) + \epsilon}} * \gamma + \beta$$

In the LLama2 architecture, **RMSNorm** is used (root mean square normalization). The main benefit of using RMSNorm is that it is more efficient than LayerNorm and it's performance decreases are neglible in practice. Notably, the hyperparameter $\beta$ is not used in RMSNorm

$$y_i=\frac{x_i}{\sqrt{\epsilon +\frac{1}{n}\sum{x_i^2}}}*\gamma_i$$


In [4]:
class RMSNorm(nn.Module):
    def __init__(self, embd_dim:int, eps=1e-5):
        super().__init__()
        self.eps = eps # epsilon
        self.embd_dim = embd_dim
        self.weight = nn.Parameter(torch.ones(embd_dim)).float()
    
    def forward(self, x:torch.Tensor):
        means = x.pow(2).mean(dim=-1, keepdim=True)
        x_norm = x * torch.rsqrt(means + self.eps)
        return (x_norm * self.weight).to(dtype=x.dtype)

##### Forward Pass Step Through


In [5]:
inputs = torch.rand((4, 8))
rmsn = RMSNorm(embd_dim=8)
inputs

tensor([[0.4365, 0.5728, 0.3160, 0.7362, 0.0550, 0.2335, 0.0010, 0.3170],
        [0.2950, 0.1941, 0.4875, 0.4818, 0.1934, 0.6766, 0.4779, 0.0472],
        [0.0565, 0.3778, 0.6870, 0.1934, 0.3055, 0.6714, 0.5032, 0.8174],
        [0.4360, 0.7093, 0.9083, 0.5762, 0.0884, 0.0227, 0.2693, 0.3611]])

In [6]:
# Current unnormalized sum
inputs.sum(dim=-1)

tensor([2.6681, 2.8536, 3.6122, 3.3713])

In [9]:
# Take the mean along the feature dimension
inputs_mean = inputs.pow(2).mean(dim=1, keepdim=True)
inputs_mean

tensor([[0.1648],
        [0.1651],
        [0.2651],
        [0.2577]])

In [11]:
# Add epsilon and take sqrt
torch.sqrt(inputs_mean + rmsn.eps)

tensor([[0.4060],
        [0.4063],
        [0.5149],
        [0.5076]])

In [12]:
# But since we have x / RMS(x) lets take the reciprocal root
torch.rsqrt(inputs_mean + rmsn.eps)

tensor([[2.4630],
        [2.4613],
        [1.9422],
        [1.9699]])

In [15]:
# now multiply by the numerator x (inputs in this case) 
inputs_norm = inputs * torch.rsqrt(inputs_mean + rmsn.eps)
inputs_norm

tensor([[1.0752, 1.4109, 0.7782, 1.8134, 0.1354, 0.5751, 0.0025, 0.7809],
        [0.7261, 0.4779, 1.2000, 1.1860, 0.4759, 1.6655, 1.1763, 0.1161],
        [0.1097, 0.7339, 1.3342, 0.3756, 0.5934, 1.3039, 0.9774, 1.5875],
        [0.8589, 1.3973, 1.7893, 1.1350, 0.1741, 0.0447, 0.5304, 0.7114]])

In [17]:
# Finally, multiply by gamma_i, which are the learnable weights
inputs_norm * rmsn.weight

tensor([[1.0752, 1.4109, 0.7782, 1.8134, 0.1354, 0.5751, 0.0025, 0.7809],
        [0.7261, 0.4779, 1.2000, 1.1860, 0.4759, 1.6655, 1.1763, 0.1161],
        [0.1097, 0.7339, 1.3342, 0.3756, 0.5934, 1.3039, 0.9774, 1.5875],
        [0.8589, 1.3973, 1.7893, 1.1350, 0.1741, 0.0447, 0.5304, 0.7114]],
       grad_fn=<MulBackward0>)

In [18]:
rms = RMSNorm(inputs.size(-1))
rms.forward(inputs)

tensor([[1.0752, 1.4109, 0.7782, 1.8134, 0.1354, 0.5751, 0.0025, 0.7809],
        [0.7261, 0.4779, 1.2000, 1.1860, 0.4759, 1.6655, 1.1763, 0.1161],
        [0.1097, 0.7339, 1.3342, 0.3756, 0.5934, 1.3039, 0.9774, 1.5875],
        [0.8589, 1.3973, 1.7893, 1.1350, 0.1741, 0.0447, 0.5304, 0.7114]],
       grad_fn=<MulBackward0>)

### **Activation Changes**

Activation functions are non-linear functions connection linear layers in a neural network. Without the non-linear connection neural networks would only learn linear relationships. For example, if I have two matrices - without the activation function of course - and multiply them together, the output matrix would be equivalent to the first two layers, meaning the two layers would collapse into one.

There are a variety of activation functions, a simple one being ReLU which essentially clips any negative number to 0, leaving the positive numbers be. This became popular as it helps to circumvent the _vanishing gradients_ issue in NN weights. Recently modern activation functions like GeLU and SiLU, which have smoother approximations, perform better than ReLU. The smoothness in these modern activtation functions allow for more nuanced learning since any negative number isn't immediately cuttoff to 0, like ReLU.

Llama 2 uses **SiLU** or sigmoid-weighted linear units.

$$silu(x) = x * \sigma(x)$$

Using PyTorch, SiLU is simply implemented below


In [19]:
nn.SiLU()

SiLU()

### **Feed Forward Layer**
