# Implement Attention from Scratch
###  Problem Statement
Transformers are order-agnostic — they see tokens like goldfish: no sense of sequence. To inject **position awareness** into the model, we use **Sinusoidal Positional Embeddings**, where each position in the sequence gets a unique deterministic vector. These vectors are computed using sine and cosine waves at different frequencies.

Your task is to implement the sinusoidal position encoding mechanism from scratch using PyTorch — no cheating with built-ins from `fairseq` or Hugging Face.

---

###  Requirements

1. **Define the Sinusoidal Embedding Class**
   - Implement a `SinusoidalPositionalEmbedding` class inheriting from `nn.Module`.
   - Initialize with `max_seq_len` and `d_model`.
   - Create a tensor `pe` of shape `(max_seq_len, d_model)` filled with sine and cosine encodings:
     - `sin(position * ω)` for even indices
     - `cos(position * ω)` for odd indices

2. **Register as Buffer**
   - Use `self.register_buffer("pe", pe)` to store `pe` without treating it as a trainable parameter.

3. **Generate Encodings**
   - On calling `forward(x)`, return the slice of positional encodings matching the sequence length of `x`.

4. **Test the Embeddings**
   - Initialize the embedding class with `max_seq_len = 100` and `d_model = 64`.
   - Pass a sequence of length 50 to verify the returned shape is `(1, 50, 64)`.

---

### Constraints

- ✅ Do not use Hugging Face, Fairseq, or built-in PyTorch modules for position encoding.
- ✅ Ensure the `pe` tensor is not a trainable parameter.
- ✅ Support any sequence length up to `max_seq_len`.
- ❌ Do not inject these embeddings directly into token embeddings yet — this is just the embedding module.

---

<details>
  <summary>💡 Hint</summary>

  - Use `torch.arange(0, max_seq_len).unsqueeze(1)` to create position indices.
  - Compute frequencies with `torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))`.
  - Alternate `sin` and `cos` values for even and odd embedding dimensions.
  - When returning the embedding in `forward`, use `.unsqueeze(0)` to broadcast over the batch dimension.

</details>

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import math

In [4]:
# Synthetic data
torch.manual_seed(42)
batch_size = 3
seq_len = 4
d_model = 8
num_heads = 2

q = torch.rand(batch_size, seq_len, d_model)
k = torch.rand(batch_size, seq_len, d_model)
v = torch.rand(batch_size, seq_len, d_model)
print(q.shape)

device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cpu"

torch.Size([3, 4, 8])


In [5]:
class SinusoidalPositionalEmbedding(nn.Module):
    def __init__(self, max_seq_len: int, d_model: int):
        """
        Initializes the sinusoidal positional embedding.
        
        Args:
            max_seq_len (int): Maximum sequence length.
            d_model (int): Embedding dimension.
        """
        super().__init__()
        
        # Create a matrix of shape (max_seq_len, d_model)
        pe = torch.zeros(max_seq_len, d_model)
        
        # Position indices (0, 1, 2, ..., max_seq_len-1)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        
        # Compute the div_term using the exponential decay formula
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        # Apply sin to even indices and cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Register as buffer (not a parameter, but saved in the model)
        self.register_buffer("pe", pe)

    def forward(self, x):
        """
        Returns the positional embedding for a given input tensor.
        
        Args:
            x (Tensor): Input tensor of shape (batch_size, seq_len, d_model).
        
        Returns:
            Tensor: Positional embeddings of shape (batch_size, seq_len, d_model).
        """
        return self.pe[:x.shape[1], :].unsqueeze(0)  # Shape: (1, seq_len, d_model)

In [6]:
# from fairseq.modules.sinusoidal_positional_embedding import SinusoidalPositionalEmbedding

max_seq_len = 100
d_model = 64

# Fairseq's implementation requires the number of embeddings (seq length) and embedding dim
# pos_emb = SinusoidalPositionalEmbedding(d_model, max_seq_len, padding_idx=None)

# Generate embeddings for a sequence of length 50
seq_len = 50
positions = torch.arange(seq_len).unsqueeze(0)  # Shape: (1, seq_len)
# positional_encoding = pos_emb(positions)  # Shape: (1, seq_len, d_model)

custom_pos_emb = SinusoidalPositionalEmbedding(d_model, max_seq_len)

positional_encoding_custom = custom_pos_emb(positions)

print(positional_encoding_custom.shape)  # (1, 50, 64)


torch.Size([1, 50, 100])
