# Sinusodial Positional Embeddings
[> Sinusodial Positional encoding code](./sinusodial_PE.py)

### 1. Why embedding dimension have to be even ?
- The embedding dimension must be even because the sinusoidal encoding alternates between sine and cosine functions. For each position, even indices use sine and odd indices use cosine.
- If the embedding dimension were odd, there would be no way to alternate between sine and cosine functions for all positions. and we wouldn't have matching pairs
```python
if emb_dim % 2 != 0:
    raise ValueError(f"emb_dim must be even, but got{emb_dim}")

```

### 2. what is dropout ? why we need dropout here ?
- Dropout is a regularization technique that randomly sets a fraction of input units to 0 during training to prevent overfitting.
-  it's applied to the combined input embeddings and positional encodings to help the model generalize better by preventing `co-adaptation of features`.

### 2.1 Define co-adaption of features ?
Co-adaptation occurs when certain neurons in a neural network rely too heavily on specific other neurons, making the network less robust and more prone to overfitting.
> For example, if neuron A always activates with neuron B, the network might not learn meaningful patterns independently.

**Why it's problematic:**
- Reduces model generalization
- Makes the network sensitive to specific training data patterns
- Can lead to overfitting

**How dropout helps:**
- Randomly drops out (sets to zero) a fraction of the input units during training 
- Randomly deactivates neurons during training
- Encourages the network to learn more robust features
- Prevents co-adaptation by forcing neurons to learn independently
- Prevents neurons from relying too much on specific activations
- Acts as an ensemble method by training different subnetworks



### 3. why we are having position as arange and then unsqueeze ?
```python
# create a 1D  tensor of positions [0, 1, ..., seq_len-1]
torch.arange(seq_len)
# unsqueeze to make it a 2D tensor of shape [seq_len, 1]
# like [[0], [1], ..., [seq_len-1]]
torch.arange(seq_len).unsqueeze(1)
```
- This is done to enable broadcasting when multiplying with `div_term` which has shape `[emb_dim//2]`

### 4. why we are using register buffer ?
```python
self.register_buffer('pe', pe.unsqueeze(0))
```
- `register_buffer` is used to register a buffer(a tensor) (e.g. `pos_emb`) that is not a parameter of the module but should be part of the module's state.
    -  Moved to the same device as the module's parameters
    -  Saved in the state_dict for saving and loading
    - Not considered a trainable parameter (unlike `nn.Parameter`)
    - The positional encodings are constant and don't need gradients, so they're stored as a buffer.


### 5. what is forward doing here ? 
1. Takes input `x` of shape (batch_size, seq_len, emb_dim)
2. Adds the positional encoding to the input tensor.
3. Slices the positional encoding to match the input sequence length.
4. applies dropout to the result
5. Returns the position aware embeddings

### Extra: why did we use nn.Module in the class ?
nn.Module is the base class for all neural network modules in PyTorch. 

1. Parameter Management
    - Tracks all `nn.Parameter` objects
    - Enables automatic differentiation
    - Handles moving parameters to GPU/CPU

2. State Management
    - Maintains model state (train/eval modes)
    - Handles saving/loading model state ( model persistence )
    - Manage buffer (register) 
3. Module Composition
    - Enables building complex architecture
    - Supports nested modules
    - Provide `to()`, `train()`, `eval()`, `parameters()`, `state_dict()`, `load_state_dict()` methods
4. Forward Hooks
    - allow to insert custom operations at any point in the forward pass
    - useful for debugging, visualization, and custom training 
5. Integrate with PyTorch's autograd engine
    - Enable custom gradient computation
    - Support custom backward passes
    - compatible with DataLoader

in our case `nn.Module` help us to
1. Register the positional encoding buffer
2. Use the module in a PyTorch model
Move all tensors to the correct device (CPU/GPU) automatically
3. Save/load the model's state including the positional encodings


# Learned embeddings 
[> Learned Positional encoding code](./learned_PE.py)

### 1. why we are using embedding layer ? not linear or parameter layer ?
- an embedding layer is a lookup table where the key is the position index and the value is a learned vector of size emb_dim.
    - we use it for position encoding, word embeddings or any discrete feature mapping
- a linear layer is a layer that performs a linear transformation of the input tensor.
    - it simply applies a learned parameter matrix to the input tensor.
    -  we use it for feature transformation, non-linearities, etc.
- a parameter layer is a layer that has a parameter matrix that is learned during training.
    - we use it when we need direct access to learnable parameters with custom update logic
    - we use it when we need to share parameters across different parts of the model

# Positional Embeddings: Implementation and Comparison

## 1. Core Implementation Choices

### 1.1 Embedding Layer
- **Lookup Efficiency**: O(1) complexity for direct position vector access
- **Memory Layout**: Optimized for sparse lookups
- **Gradient Flow**: Clean, direct gradient flow to position vectors
- **Parameter Storage**: Stores exactly one vector per position

### 1.2 Why Not Linear Layer?
- **Complexity**: O(n²) due to matrix multiplication
- **Memory**: Requires weight matrix of size [seq_len, emb_dim]
- **Inefficiency**: Processes all positions even if not used
- **Gradient Flow**: Unnecessary computation through weight matrix

### 1.3 Why Not Parameter Layer?
- **Memory**: Similar to embedding but less efficient
- **Optimization**: No specialized sparse lookup optimization
- **Implementation**: Less optimized backward pass
- **Flexibility**: Harder to extend with relative positions

## 2. Key Features

### 2.1 Scaling Factor
- **Purpose**: Controls magnitude of positional embeddings
- **Benefit**: Prevents dominance over input embeddings
- **Implementation**: Simple multiplicative scaling
- **Default**: 1.0 (no scaling)

### 2.2 Layer Normalization
- **Purpose**: Stabilizes training
- **Benefit**: Normalizes combined embeddings
- **Placement**: After position addition
- **Impact**: Improves gradient flow

### 2.3 Relative Positions
- **Purpose**: Captures position relationships
- **Benefit**: Better for tasks where relative positions matter
- **Implementation**: Additional learnable bias terms
- **Use Case**: When sequence order matters more than absolute position

## 3. Performance Considerations

### 3.1 Memory Efficiency
- Embedding: Most efficient (one vector per position)
- Linear: Least efficient (full weight matrix)
- Parameter: Similar to embedding but less optimized

### 3.2 Training Dynamics
- **Sparse Updates**: Only updates used positions
- **Gradient Flow**: Direct paths to position vectors
- **Convergence**: Typically faster with embedding layers
- **Stability**: Improved with proper scaling and normalization

## 4. Best Practices

1. **Always** use embedding layers for positional encodings
2. **Consider** adding layer normalization for deeper networks
3. **Use** relative positions for tasks where sequence relationships matter
4. **Experiment** with scaling factors (start with 1.0)
5. **Monitor** gradient norms to ensure stable training

## 5. Example Usage

```python
# Basic usage
pe = LearnedPositionalEmbedding(emb_dim=512)

# With all features
pe = LearnedPositionalEmbedding(
    emb_dim=512,
    dropout=0.1,
    seq_len=1024,
    scale=0.5,
    use_ln=True,
    relative=True
)

### 2. why we register positions as buffer ?
- Registered buffers are saved and loaded with the model's state dictionary => model maintain a state dictionary and thanks to nn.Moduel
- Ensures consistent behavior when saving/loading the model
- Automatically moves to the same device as the module's parameters => No need for manual .to(device) calls
- Positions are fixed indices, not learnable parameters
- More memory efficient than parameters since they don't store gradients => parameters stores gradient as well
- Created once during initialization => that's why they are in `__init__()` => Avoids recreating the position tensor on every forward pass
- Ensures consistent behavior across different runs

Without it 
1. inefficient
    - we would create a new tensor on every forward pass
2. Device mismatch
    - Could lead to "tensors on different devices" errors
3. State Issues: Positions wouldn't be saved with the model


### 3. how we used positional embedding in forward pass ?
```python
pos_embeddings = self.position_embedding(self.positions[:seq_len])

x = x + pos_embeddings
```


### 4. why are we returning with dropout ?

# RoPE
[> RoPE code](./RoPE.py)


### 1. what is inverse frequency ? what is it's need ?

### 2. what are the actual practices done instead of precomputing the positions with arange ?

### 3. what is outer product ?

### 4. what is einsum ? what this line implies ?
```python
freqs = torch.einsum("i,j->ij", t, inv_freq)
```


### 5. why we are concatenating frequencies to handle both sin and cos ? what is the pratice followed here ?
```python
emb = torch.cat((freqs, freqs), dim=-1)
```


### 6. why we are registering buffer for cos and sin ?

### 7. x_pairs significance in the code ?
```python
x_pairs = x.float().reshape(*x.shape[:-1], -1, 2)
```


### 8. why we took even and odd indexed features as real and imaginary respectively ?
```python
x1 = x_pairs[..., 0]  # Real part
x2 = x_pairs[..., 1]  # Imaginary part
```


### 9. why element wise multiplication instead of matrix multiplication ?

### 10. why are returning both query and key ?

# ALiBi
[> ALiBi code](./Alibi_PE.py)


### 1. why we needed slopes and heads unlike others ?

### 2. How we implemented the concepts from paper line by line ?

### 3. what is alibi_bias ? in significance

### 4. difference between model's state and a trainable parameter

### 5. How we are calculating the slopes ?

### 6. Significance of alibi bias to the attention scores ?

### 7. why we are slicing the alibi bias ?

### Extra: Practices for implementing the equation in sequence

# Comparison of Sinusodial, learned embeddings, RoPE, ALiBi
| Property | Sinusoidal | Learned Embeddings | RoPE | ALiBi |
|----------|------------|-------------------|------|-------|
| **Type** | Deterministic | Learned | Hybrid (deterministic + learned) | Learned |
| **Training** | Fixed, not trainable | Trainable | Partially trainable (applies rotation to learned queries/keys) | Trainable |
| **Sequence Length** | Fixed maximum length | Fixed maximum length | Flexible, better generalization | Excellent extrapolation to longer sequences |
| **Position Info** | Absolute positions | Absolute positions | Relative positions | Relative positions with bias |
| **Computation** | Additive | Additive | Multiplicative (rotations) | Multiplicative (attention bias) |
| **Memory** | Low | Medium | Medium | Low |
| **Performance** | Good for short sequences | Better than sinusoidal with careful init | Excellent for relative positions | Best for long sequences |
| **Training Speed** | Fastest | Slower (learned params) | Moderate | Fast (efficient bias) |
| **Use Cases** | Early Transformers | Early BERT variants | LLaMA, GPT-Neo | BLOOM, BLOOMZ |
| **Extrapolation** | Poor | Poor | Good | Excellent |
| **Implementation** | Simple | Simple | Complex | Moderate |
| **Attention Pattern** | Global | Global | Local + Global | Local + Global |
| **Gradient Flow** | Stable | Can be unstable | Stable | Very stable |
| **Popular Models** | Original Transformer | BERT, GPT-2 | LLaMA, GPT-J | BLOOM, BLOOMZ |
| **Relative Position** | No | No | Yes | Yes |
| **Train/Test Length** | Must match | Must match | Can differ | Can differ significantly |

