## Embedding

## Definition: Embedding
An embedding layer maps each discrete token ID to a learnable vector, providing the continuous representation needed by neural networks. During training, only the embedding rows corresponding to the tokens that appear in the current batch receive gradient updates.
$$  \text{Embedding}(i)=w_i$$

Input
* $i \in \mathbb{Z}_{d_{\text{in}}}$ 

Weights
* weight $W=[w_1|w_2|\dots|w_{d_{\text{in}}}]^T \in \mathbb{R}^{d_{\text{in}}\times d_{\text{emb}}}$ 

Output
* $o = \text{Embedding}(i)=w_i \in \mathbb{R}^{d_{\text{emb}}}$

## Code: Embedding

In [51]:
import torch
import torch.nn as nn

class Embedding(nn.Module):
    def __init__(self, num_embeddings: int, embedding_dim: int):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(num_embeddings, embedding_dim))

    def forward(self, input_indices: torch.Tensor):
        """
        input_indices: (batch, seq_len) or any shape of integers in [0, num_embeddings)
        """
        return self.weight[input_indices]

## Testing

### Evaluation

#### custom implementation

In [52]:
torch.manual_seed(0)
embed = Embedding(5, 3)  # vocab_size=100, embedding_dim=32

x = torch.tensor([0,1,2,2,3,4])
embed(x)

tensor([[ 1.5410, -0.2934, -2.1788],
        [ 0.5684, -1.0845, -1.3986],
        [ 0.4033,  0.8380, -0.7193],
        [ 0.4033,  0.8380, -0.7193],
        [-0.4033, -0.5966,  0.1820],
        [-0.8567,  1.1006, -1.0712]], grad_fn=<IndexBackward0>)

#### torch implementation

In [53]:
torch.manual_seed(0)
nn_embed = Embedding(5, 3)
nn_embed(x)

tensor([[ 1.5410, -0.2934, -2.1788],
        [ 0.5684, -1.0845, -1.3986],
        [ 0.4033,  0.8380, -0.7193],
        [ 0.4033,  0.8380, -0.7193],
        [-0.4033, -0.5966,  0.1820],
        [-0.8567,  1.1006, -1.0712]], grad_fn=<IndexBackward0>)

## Training

In [54]:
target = torch.randn(3, 3)  # same shape as output
x_train = torch.tensor([0,2,4])
loss_fn = nn.MSELoss()

#### custom implementation

In [55]:
optimizer = torch.optim.SGD(embed.parameters(), lr=0.1)
output = embed(x_train)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
embed(x)

tensor([[ 1.5095, -0.2995, -2.1221],
        [ 0.5684, -1.0845, -1.3986],
        [ 0.3746,  0.7859, -0.6950],
        [ 0.3746,  0.7859, -0.6950],
        [-0.4033, -0.5966,  0.1820],
        [-0.8053,  1.0970, -1.0302]], grad_fn=<IndexBackward0>)

#### torch implementation

In [56]:
optimizer = torch.optim.SGD(nn_embed.parameters(), lr=0.1)
output = nn_embed(x_train)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
nn_embed(x)

tensor([[ 1.5095, -0.2995, -2.1221],
        [ 0.5684, -1.0845, -1.3986],
        [ 0.3746,  0.7859, -0.6950],
        [ 0.3746,  0.7859, -0.6950],
        [-0.4033, -0.5966,  0.1820],
        [-0.8053,  1.0970, -1.0302]], grad_fn=<IndexBackward0>)