# Sinusoidal Position Encoding

In [None]:
import torch

## How does the code work?

In [2]:
max_len = 8
d_model = 20
base = 10000

The first step is to create a column vector containing all the indices for each token

In [3]:
position = torch.arange(0, max_len).unsqueeze(1).float()
position

tensor([[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.]])

Next we want to create a vector containing the indices for each aspect of the embedding dimension

In [4]:
torch.arange(0, d_model).float()

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
        14., 15., 16., 17., 18., 19.])

We apply this into our division term, and get a vectorized form of the desired constants

In [5]:
div_term = base ** (-torch.arange(0, d_model).float() / d_model)
div_term

tensor([1.0000e+00, 6.3096e-01, 3.9811e-01, 2.5119e-01, 1.5849e-01, 1.0000e-01,
        6.3096e-02, 3.9811e-02, 2.5119e-02, 1.5849e-02, 1.0000e-02, 6.3096e-03,
        3.9811e-03, 2.5119e-03, 1.5849e-03, 1.0000e-03, 6.3096e-04, 3.9811e-04,
        2.5119e-04, 1.5849e-04])

Using these we create the tensor of angle rates for the sin and cos functions to operate on

In [6]:
angle_rates = position * div_term 
angle_rates

tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00],
        [1.0000e+00, 6.3096e-01, 3.9811e-01, 2.5119e-01, 1.5849e-01, 1.0000e-01,
         6.3096e-02, 3.9811e-02, 2.5119e-02, 1.5849e-02, 1.0000e-02, 6.3096e-03,
         3.9811e-03, 2.5119e-03, 1.5849e-03, 1.0000e-03, 6.3096e-04, 3.9811e-04,
         2.5119e-04, 1.5849e-04],
        [2.0000e+00, 1.2619e+00, 7.9621e-01, 5.0238e-01, 3.1698e-01, 2.0000e-01,
         1.2619e-01, 7.9621e-02, 5.0238e-02, 3.1698e-02, 2.0000e-02, 1.2619e-02,
         7.9621e-03, 5.0238e-03, 3.1698e-03, 2.0000e-03, 1.2619e-03, 7.9621e-04,
         5.0238e-04, 3.1698e-04],
        [3.0000e+00, 1.8929e+00, 1.1943e+00, 7.5357e-01, 4.7547e-01, 3.0000e-01,
         1.8929e-01, 1.1943e-01, 7.5357e-02, 4.7547e-02, 3.0000e-02, 1.8929e-02,
       

We initialize an empty matrix with the correct dimensions

In [7]:
pe = torch.zeros(max_len, d_model)

Before we add to it, let's inspect what's going on in this expresion. This contains every alternation column (starting at inidex 0) from angle rates. This kind of expression allows us to interweave the sin and cos operations easily.

In [8]:
angle_rates[:, 0::2]

tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.0000e+00, 3.9811e-01, 1.5849e-01, 6.3096e-02, 2.5119e-02, 1.0000e-02,
         3.9811e-03, 1.5849e-03, 6.3096e-04, 2.5119e-04],
        [2.0000e+00, 7.9621e-01, 3.1698e-01, 1.2619e-01, 5.0238e-02, 2.0000e-02,
         7.9621e-03, 3.1698e-03, 1.2619e-03, 5.0238e-04],
        [3.0000e+00, 1.1943e+00, 4.7547e-01, 1.8929e-01, 7.5357e-02, 3.0000e-02,
         1.1943e-02, 4.7547e-03, 1.8929e-03, 7.5357e-04],
        [4.0000e+00, 1.5924e+00, 6.3396e-01, 2.5238e-01, 1.0048e-01, 4.0000e-02,
         1.5924e-02, 6.3396e-03, 2.5238e-03, 1.0048e-03],
        [5.0000e+00, 1.9905e+00, 7.9245e-01, 3.1548e-01, 1.2559e-01, 5.0000e-02,
         1.9905e-02, 7.9245e-03, 3.1548e-03, 1.2559e-03],
        [6.0000e+00, 2.3886e+00, 9.5094e-01, 3.7857e-01, 1.5071e-01, 6.0000e-02,
         2.3886e-02, 9.5094e-03, 3.7857e-03, 1.5071e-03],
        [7.0000e+00, 2.7868

We now add the sin and cos operations accordingly to create the desired sinusoidal matrix

In [9]:
pe[:, 0::2] = torch.sin(angle_rates[:, 0::2])
pe

tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 8.4147e-01,  0.0000e+00,  3.8767e-01,  0.0000e+00,  1.5783e-01,
          0.0000e+00,  6.3054e-02,  0.0000e+00,  2.5116e-02,  0.0000e+00,
          9.9998e-03,  0.0000e+00,  3.9811e-03,  0.0000e+00,  1.5849e-03,
          0.0000e+00,  6.3096e-04,  0.0000e+00,  2.5119e-04,  0.0000e+00],
        [ 9.0930e-01,  0.0000e+00,  7.1471e-01,  0.0000e+00,  3.1170e-01,
          0.0000e+00,  1.2586e-01,  0.0000e+00,  5.0217e-02,  0.0000e+00,
          1.9999e-02,  0.0000e+00,  7.9621e-03,  0.0000e+00,  3.1698e-03,
          0.0000e+00,  1.2619e-03,  0.0000e+00,  5.0238e-04,  0.0000e+00],
        [ 1.4112e-01,  0.0000e+00,  9.2997e-01,  0.0000e+00,  4.5775e-01,
          0.0000e+00,  1.8816e-01, 

In [10]:
angle_rates[:, 1::2]

tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [6.3096e-01, 2.5119e-01, 1.0000e-01, 3.9811e-02, 1.5849e-02, 6.3096e-03,
         2.5119e-03, 1.0000e-03, 3.9811e-04, 1.5849e-04],
        [1.2619e+00, 5.0238e-01, 2.0000e-01, 7.9621e-02, 3.1698e-02, 1.2619e-02,
         5.0238e-03, 2.0000e-03, 7.9621e-04, 3.1698e-04],
        [1.8929e+00, 7.5357e-01, 3.0000e-01, 1.1943e-01, 4.7547e-02, 1.8929e-02,
         7.5357e-03, 3.0000e-03, 1.1943e-03, 4.7547e-04],
        [2.5238e+00, 1.0048e+00, 4.0000e-01, 1.5924e-01, 6.3396e-02, 2.5238e-02,
         1.0048e-02, 4.0000e-03, 1.5924e-03, 6.3396e-04],
        [3.1548e+00, 1.2559e+00, 5.0000e-01, 1.9905e-01, 7.9245e-02, 3.1548e-02,
         1.2559e-02, 5.0000e-03, 1.9905e-03, 7.9245e-04],
        [3.7857e+00, 1.5071e+00, 6.0000e-01, 2.3886e-01, 9.5094e-02, 3.7857e-02,
         1.5071e-02, 6.0000e-03, 2.3886e-03, 9.5094e-04],
        [4.4167e+00, 1.7583

In [11]:
pe[:, 1::2] = torch.cos(angle_rates[:, 1::2])
pe

tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,
          0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  8.0746e-01,  3.8767e-01,  9.6862e-01,  1.5783e-01,
          9.9500e-01,  6.3054e-02,  9.9921e-01,  2.5116e-02,  9.9987e-01,
          9.9998e-03,  9.9998e-01,  3.9811e-03,  1.0000e+00,  1.5849e-03,
          1.0000e+00,  6.3096e-04,  1.0000e+00,  2.5119e-04,  1.0000e+00],
        [ 9.0930e-01,  3.0399e-01,  7.1471e-01,  8.7644e-01,  3.1170e-01,
          9.8007e-01,  1.2586e-01,  9.9683e-01,  5.0217e-02,  9.9950e-01,
          1.9999e-02,  9.9992e-01,  7.9621e-03,  9.9999e-01,  3.1698e-03,
          1.0000e+00,  1.2619e-03,  1.0000e+00,  5.0238e-04,  1.0000e+00],
        [ 1.4112e-01, -3.1654e-01,  9.2997e-01,  7.2925e-01,  4.5775e-01,
          9.5534e-01,  1.8816e-01, 

# Concrete Example (Code)

In [17]:
from sinusoidal import SinusoidalPositionalEncoding

spe = SinusoidalPositionalEncoding(6)

In [18]:
encoding = spe(torch.zeros(1,12,6))
encoding

tensor([[[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
         [ 0.8415,  0.9769,  0.0464,  0.9999,  0.0022,  1.0000],
         [ 0.9093,  0.9086,  0.0927,  0.9998,  0.0043,  1.0000],
         [ 0.1411,  0.7983,  0.1388,  0.9996,  0.0065,  1.0000],
         [-0.7568,  0.6511,  0.1846,  0.9992,  0.0086,  1.0000],
         [-0.9589,  0.4738,  0.2300,  0.9988,  0.0108,  1.0000],
         [-0.2794,  0.2746,  0.2749,  0.9982,  0.0129,  1.0000],
         [ 0.6570,  0.0627,  0.3192,  0.9976,  0.0151,  1.0000],
         [ 0.9894, -0.1522,  0.3629,  0.9968,  0.0172,  1.0000],
         [ 0.4121, -0.3599,  0.4057,  0.9960,  0.0194,  1.0000],
         [-0.5440, -0.5511,  0.4477,  0.9950,  0.0215,  1.0000],
         [-1.0000, -0.7167,  0.4887,  0.9940,  0.0237,  1.0000]]])

In [20]:
torch.dot(encoding[0,3], encoding[0,4]), torch.dot(encoding[0,3], encoding[0,10])

(tensor(2.4374), tensor(1.5401))

# Sinusoidal Positional Encoding â€“ Concrete Example

This example demonstrates how sinusoidal positional encoding works using **manual values** for position and dimension. It is based on the formula from *Attention is All You Need*:

$$
\text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad
\text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
$$
---

## ðŸ“Œ Parameters

- $\text{pos} = 3$
- $d_{\text{model}} = 6$ (total embedding dimensions)
- $i \in \{0, 1, 2, 3, 4, 5\}$

We compute the encoding values for each dimension of position 3.

---

## ðŸ§® Step-by-Step Calculation for PE(3)

| Dimension \( i \) | Type | Exponent \( \frac{i}{d_{\text{model}}} \) | Rate \( 10000^{-\text{exp}} \) | Angle \( 3 \times \text{Rate} \) | PE(3)\(_i\) |
|------------------|------|-------------------------------------------|-------------------------------|--------------------------|-------------|
| 0                | sin  | 0.000                                     | 1.000000                      | 3.000000                 | 0.1411      |
| 1                | cos  | 0.167                                     | 0.464159                      | 1.392478                 | 0.7983      |
| 2                | sin  | 0.333                                     | 0.215443                      | 0.646329                 | 0.1388      |
| 3                | cos  | 0.500                                     | 0.100000                      | 0.300000                 | 0.9996      |
| 4                | sin  | 0.667                                     | 0.046416                      | 0.139248                 | 0.0065      |
| 5                | cos  | 0.833                                     | 0.021544                      | 0.064633                 | 1.0000      |

---

## âœ… Final Embedding Vector for Position 3


$\Rightarrow \text{PE}(3) = [0.1411,\ 0.7983,\ 0.1388,\ 0.9996,\ 0.0065,\ 1.0000]$

---

## ðŸ’¡ Interpretation

Each dimension represents a sinusoidal signal with a different wavelength. Lower dimensions vary slowly (longer wavelength), and higher dimensions oscillate faster (shorter wavelength). These patterns allow the model to distinguish positions and compute relative distances via the inner product.

---

## ðŸ”¢ Elementwise Dot Product: PE(3) Â· PE(4)

We manually compute:

$\text{PE}(4) = [-0.7568,\ 0.6511,\ 0.1846,\ 0.9992,\ 0.0086,\ 1.0000]$

$\text{PE}(3) \cdot \text{PE}(4) = \sum_{i=0}^{5} \text{PE}(3)_i \cdot \text{PE}(4)_i$

| \( i \) | PE(3)\(_i\) | PE(4)\(_i\) | Term |
|--------:|------------:|------------:|------:|
| 0       | 0.1411      | -0.7568     | -0.1068 |
| 1       | 0.7983      |  0.6511     |  0.5198 |
| 2       | 0.1388      |  0.1846     |  0.0256 |
| 3       | 0.9996      |  0.9992     |  0.9988 |
| 4       | 0.0065      |  0.0086     |  0.0001 |
| 5       | 1.0000      |  1.0000     |  1.0000 |

### âœ… Final Result:

$$
\text{PE}(3) \cdot \text{PE}(4) = 
-0.1068 + 0.5198 + 0.0256 + 0.9988 + 0.0001 + 1.0000 = \boxed{2.4375}
$$

---

## ðŸ”¢ Elementwise Dot Product: PE(3) Â· PE(10)

$\text{PE}(10) = [-0.5440,\ -0.5511,\ 0.4477,\ 0.9950,\ 0.0215,\ 1.0000]$

$\text{PE}(3) \cdot \text{PE}(10) = \sum_{i=0}^{5} \text{PE}(3)_i \cdot \text{PE}(10)_i$

| \( i \) | PE(3)\(_i\) | PE(10)\(_i\) | Term |
|--------:|------------:|-------------:|------:|
| 0       | 0.1411      | -0.5440      | -0.0768 |
| 1       | 0.7983      | -0.5511      | -0.4399 |
| 2       | 0.1388      |  0.4477      |  0.0621 |
| 3       | 0.9996      |  0.9950      |  0.9946 |
| 4       | 0.0065      |  0.0215      |  0.0001 |
| 5       | 1.0000      |  1.0000      |  1.0000 |

### âœ… Final Result:

$$
\text{PE}(3) \cdot \text{PE}(10) = 
-0.0768 - 0.4399 + 0.0621 + 0.9946 + 0.0001 + 1.0000 = \boxed{1.5401}
$$

---

## âœ… Why This Matters

- **PE(3) Â· PE(4) = 2.4375** â†’ high alignment between close positions
- **PE(3) Â· PE(10) = 1.5401** â†’ much lower alignment for distant positions

This shows that **sinusoidal encodings decay in similarity with positional distance**, allowing Transformers to recognize proximity between tokens **without requiring learned embeddings**.

---
## âœ… Why This Matters

- **PE(3) Â· PE(4)** = 2.72934 â†’ high similarity
- **PE(3) Â· PE(10)** = 1.76347 â†’ lower similarity

As positions become farther apart, their dot product **decreases** due to **phase divergence** across sinusoidal components. This is how Transformers, even with absolute encodings, gain sensitivity to relative distance.

- Sinusoidal encoding is **parameter-free**, **generalizes** to unseen lengths, and enables **distance computation via dot product**.
- This makes it well-suited for models that need to generalize beyond training context lengths.

---