# Overview

Llama2, like the original Llama model, is based on the Google transformer architecture, with improvements. Llama's improvements include

* RMSNorm pre-normalization, inspired by GPT-3
* SwiGLU activation function, inspired by Google's PaLM
* [Multi-query attention](https://arxiv.org/abs/1911.02150) instead of multi-head attention
* Rotary positional embeddings(RoPE), inspired by GPT Neo

Llama training used the [AdamW](https://arxiv.org/abs/1711.05101) optimizer. Llama2's primary differences from Llama are increased context length (4096 vs. 2048 tokens) and [grouped-query attention(GQA)](https://arxiv.org/abs/2305.13245) instead of [multi-query attention(MQA)](https://arxiv.org/abs/1911.02150) in the two larger models.


# Architecture 

> The components that we need to implement

According to the paper, LLAMA is based on transformer architecture.

* RMSNorm
* SwiGLU
* RoPE
* Multi-Query-Attention

In [1]:
%%capture
!pip install transformers==4.37.2
!pip install datasets==2.17.0
!pip install sentencepiece==0.1.99

# Root Mean Square Layer Normalization(RMSNorm)

LLaMA2 normalizes the input of each transformer sub-layer, instead of normalizing the output. RMSNorm is extension of Layer Normalization(LayerNorm). Reason behind using RMSNorm is the computational overhead in LayerNorm. This makes improvements slow and expensive. RMSNorm achieves comparable performance against LayerNorm but reduces the runing time. For the LayerNorm, it has two properties.

**Re-centring**

It makes model insensitive to shift noises on both input and weights.

**Re-scaling**

It keeps the output representations intact when both inputs and weighs are randomly scaled. RMSNorm claims that most of the benefits comes from re-scaling.

RMSNorm does re-scaling invariance and regularizes the summed inputs simply according to the root mean square(RMS) statistic. RMSNorm simplifies LayerNorm by totally removing the mean statistic in Eq. at the cost of sacrificing the invariance that mean normalization affords. When the mean of summed inputs is zero, RMSNorm is exactly equal to LayerNoFm. Although RMSNorm does not re-center.

Just like Layer Normalization, we also have a learnable parameter `gamma` $g$ that is multiplied by the normalized values.

$$\bar{a_i}=\frac{a_i}{RMS(a)}*g_{i}, where RMS(a)=\sqrt{\frac{1}{n}*\sum_{i=1}^n*a_i^2}$$

This custom scripts first standardizes the input `x`, by dividing it by its root mean square, thereby making it invariant to scaling changes. The learned weight parameter `self.weight` is applied to each element in the standarized tensor. This operation ajusts the magnitude of the values based on the learned scaling factor.

In [3]:
import torch
import torch.nn as nn

class RMSnorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps=eps
        # the gamma parameter
        self.weight=nn.Parameter(torch.ones(dim))
    
    def _norm(self, x:torch.Tensor):
        # (B, seq_len, dim)*(B, seq_len,1)=(B, seq_len, dim)
        # rsqrt: 1/sqrt(x)
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True)+self.eps)
    
    def forward(self, x:torch.Tensor):
        # weight is a gain parameter used to re-scale the standardized summed inputs
        return self.weight*self._norm(x.float()).type_as(x)

# Rotary Positional Embeddings(RoPE)

**What's the difference between the absolute positional encodings and the relative ones?**

1. **Absolute positional encodings** are fixed vectors that are added to the embedding of a token represent its absolute position in the sentence. So, it deals with one token at a time. You can think of it as the pair(latitude, longitude) on a map: each point on earth will have a unique pair.

$$e_{ij}=\frac{(x_{i}*W^2)(x_{i}*W^K)^T}{\sqrt{d_{z}}}$$

2. **Relative positional encodings**, on the other hand, deals with two toknes at a time and it is involved when we calculate the attention; since the attention machnism captures the "intensity" of how much two words are related two each other, relative positional encodings tells the attention mechanism the distance between the two words involved in it. So, given two tokens, we creat a vector that represents their distance.

$$e_{ij}=\frac{x_iW^Q*(x_j*W^K+a_{ij}^K)^T}{\sqrt{d_z}}$$


**What are Rotary Positional Embeddings?**

RoPE is a way to encode positional information in natural langauge processing models. This type of position embedding uses a rotation matrix to include explicit relative position dependency in self-attention formulation.