RoPE (Rotatory Positional Encoding)
-----------------------------------

The sequence of token plays a a crucial role in NLP tasks like generation and translation. Positional encoding is used to enable transformers to understand the order of words in a sequence. "Attention is all you need" paper by Vaswani et al. suggested sinusoidal based functions to assign a unique position to a token in a sequence.

RoPE is a more advanced form of positional encoding that’s particularly useful for handling relative positions and long-range dependencies. It better integrates with the self-attention mechanism, improving performance in models that need to capture complex or long-term relationships, such as those in large-scale natural language processing tasks. This makes it a preferred alternative over traditional sinusoidal encoding in many modern transformer architectures.

In RoPE, positional encoding is implemented by rotating token embeddings in the vector space. The rotation ensures that the relative positional information between tokens is preserved. This means "

### A little math behind the rotation matrix. 

In following method P is rotated by an angle theta (θ) in counter clockwise direction.  A little math to understand how rotation matrix is derived.

![Alt Text](../images/rotationmatrix.jpeg)

Rotary Positional Encoding (RoPE) is designed to incorporate positional information into embeddings in a way that preserves the relationships between tokens. The reason RoPE uses multiplication instead of addition can be understood through several key points:

##### Preservation of Relative Positional Information
* Multiplication allows the positional information to be integrated in a way that maintains the relative distance between tokens. By rotating the query and key vectors based on their positions, RoPE preserves the relationships between them in a continuous manner.

* Addition, on the other hand, would simply shift the embeddings without preserving their relative angles and magnitudes, which could lead to loss of important relational information.

##### Geometric Interpretation
* The rotation (via multiplication) can be viewed as a geometric transformation in which the vector's direction is changed without altering its length. This is particularly useful in maintaining the attention mechanism’s sensitivity to the positions of tokens.

* In contrast, adding a positional encoding directly to the embeddings would create a fixed shift in their representation, distorting their inherent relationships and potentially complicating the attention mechanism.

##### Use of Orthogonal Transformations
* RoPE employs rotation matrices that are orthogonal. When you multiply a vector by an orthogonal matrix (like a rotation matrix), the result preserves the length of the vector, maintaining the magnitude and relative positioning of embeddings.

* This property ensures that the embeddings retain their original information while also incorporating positional context effectively.

##### Compatibility with Attention Mechanism
* The self-attention mechanism relies heavily on dot products between query and key vectors. When RoPE applies rotation (multiplication), it maintains the mathematical properties needed for effective attention scoring.

* Using addition would change the distribution of the embeddings, potentially leading to suboptimal attention scores and reducing the model's overall effectiveness.

### RoPE implementation (source Roformer paper)

![Alt Text](../images/ropeencoding.jpeg)

In [7]:
# this code is just for illustration purposes, for the sake of simplicity we are using 2 embedding dimensions

from common import *
np.random.seed(42)
text = "there is a cat there is a cat"
n_embed = 2  # number of embedding dimensions
tokens = text.lower().split() 
seq_len = len(tokens)  
vocab = sorted(set(tokens))
vocab_size = len(vocab)
np.random.seed(42)
embedding_matrix = np.random.rand(vocab_size, n_embed).round(3)
embedding_matrix = np.random.uniform(low=0.0, high=1.0, size=(vocab_size, n_embed))
tok2pos = {tok: i for i, tok in enumerate(vocab)}  # token to position mapping in vocab
pos2token =   {i: tok for i, tok in enumerate(tok2pos)} # position to token mapping in vocab

In [8]:
encoding = get_encoding(embedding_matrix, tokens, tok2pos)
seq_encoding = np.array([encoding[token] for token in tokens]).round(3) # get the encoding of the sequence
rope_encoding = get_rope_encoding(seq_encoding, n_embed, seq_len) # apply rope encoding to the sequence
out = rope_encoding.reshape(seq_len, n_embed) # reshape to seq_len x n_embed 

In [9]:
token1 = out[0] #there
token2 = out[3] #cat
cs_rope = cosine_similarity(token1, token2)                     # cosine similarity between there and cat after rope encoding
cs_embed = cosine_similarity(seq_encoding[0], seq_encoding[3])  # cosine similarity between there and cat before rope encoding
ed_rope = euclidean_distance(token1, token2)                    # euclidean distance between there and cat after rope encoding
ed_embed = euclidean_distance(seq_encoding[0], seq_encoding[3]) # euclidean distance between there and cat before rope encoding

print(cs_rope, cs_embed)
print(ed_rope, ed_embed)

-0.8142155015563441 0.7241382159408638
1.1898447977996685 0.8032994460349141


Rotary Positional Encoding (RoPE) changes the relative positions of tokens in the embedding space. This results in different distances and similarities between tokens compared to their original embeddings, capturing positional information and preserving the rotational invariance for attention-based models.


#### References and further reading

* RoFormer: Enhanced Transformer with Rotary Position Embedding https://arxiv.org/abs/2104.09864
* https://github.com/adalkiran/llama-nuts-and-bolts/blob/main/docs/10-ROPE-ROTARY-POSITIONAL-EMBEDDINGS.md
* Pytorch implementation - https://pytorch.org/torchtune/stable/_modules/torchtune/modules/position_embeddings.html#RotaryPositionalEmbeddings