### Overview

`nn.Embedding` in PyTorch is a module that serves as a trainable lookup table, primarily used to convert discrete input (like integer indices representing words or categories) into dense, continuous vector representations, known as `embeddings`.

In PyTorch, an `Embedding layer` is used to convert input indices into dense vectors of fixed size. It's commonly used in natural language processing (NLP) tasks, where words or tokens are represented as integers (indices), and the goal is to map these indices into continuous vector spaces.

**Lookup Table:**

It functions as a matrix where each row corresponds to an embedding vector for a specific index. When an input index is provided, nn.Embedding retrieves the corresponding row `(embedding vector)` from this matrix.


**Purpose**

The `Embedding` layer is designed to transform discrete tokens (like words) into continuous vectors. Instead of representing words as one-hot encoded vectors, which can be `sparse` and `high-dimensional`, an embedding layer represents each word as a `low-dimensional`, `dense vector`. This helps the model learn relationships between different words, as similar words can have similar vector representations.


**Key Features**

- `Input:` The input to the `Embedding` layer is usually a tensor of indices, where each index corresponds to a specific word or token.

- `Embedding Matrix:` Inside the embedding layer, PyTorch maintains a matrix where each row corresponds to the vector representation of a token. This matrix is initialized randomly (or using pretrained embeddings) and is updated during training.

- `Output:` The output of the embedding layer is a tensor where each index in the input has been replaced by its corresponding vector from the embedding matrix.



**Parameters:**

- `num_embeddings:` The size of the dictionary of embeddings, i.e., the total number of unique items (e.g., words in a vocabulary) that can be embedded.

- `embedding_dim:` The size of each embedding vector, i.e., the dimensionality of the continuous representation for each item.



**Trainable:**

The embedding vectors within the nn.Embedding layer are typically initialized randomly and are updated during the training process through backpropagation, allowing the model to learn meaningful representations based on the task at hand (e.g., natural language processing tasks like sentiment analysis or machine translation).


In [18]:
import torch
import torch.nn as nn

torch.manual_seed(123)

# Define an Embedding layer
embedding_layer = nn.Embedding(num_embeddings=10, embedding_dim=3)

# Example input (batch of token indices)
input_indices = torch.tensor([1, 2, 3, 4])

# Forward pass through the embedding layer
output = embedding_layer(input_indices)
print(output)

tensor([[-0.5880,  0.3486,  0.6603],
        [-0.2196, -0.3792,  0.7671],
        [-1.1925,  0.6984, -1.4097],
        [ 0.1794,  1.8951,  1.3689]], grad_fn=<EmbeddingBackward0>)


In [14]:
print(output.shape)

torch.Size([4, 3])


**In the examples above:**

- `num_embeddings` is the size of the vocabulary (how many distinct tokens/words you have).

- `embedding_dim` is the size of the vector space each token is embedded into.

- `input_indices` are indices of tokens, and the output will be a tensor containing their corresponding embeddings.



**Key Benefits**

- `Dense representations:` The embedding vectors are dense and typically much lower in dimensionality than one-hot encodings.

- `Learning semantic relationships:` During training, the model learns meaningful relationships between tokens. For example, in NLP, similar words tend to have similar embeddings.



**Applications**

It is widely used in `Natural Language Processing (NLP)` to represent words, subwords, or other categorical features as `dense vectors`, enabling neural networks to process and understand text data more effectively than with `sparse representations` like `one-hot encodings`. It can also be used for other types of categorical data in various machine learning applications.


- `Word Embeddings in NLP:` Words can be embedded into a continuous space, allowing the model to better understand relationships between words (e.g., using embeddings like `Word2Vec` or `GloVe`).

- `Categorical Data:` Embeddings can also be applied to other forms of discrete categorical data in different domains.


**Understanding the Output**

**1. Shape of the Output:** The output tensor is of shape `(4, 3)`. This means there are `4 rows`, each representing the embedding of one input index, and each embedding is a `3-dimensional vector`.

- The number of rows corresponds to the number of input indices (in this case, `[1, 2, 3, 4]`, i.e., 4 indices).

- The number of columns (3) corresponds to the `embedding_dim`, which defines the size of each embedding vector.



**2. Values in the Output:** The values in the output tensor are the `embedding vectors` assigned to each input index. Each vector is learned during training, initialized randomly in this case (since we haven’t trained the model yet).

  - First Row: `[-0.5880,  0.3486,  0.6603]` is the embedding vector for the token with index `1`.
  
  - Second Row: `[-0.2196, -0.3792,  0.7671]` is the embedding vector for the token with index `2`.

  - Third Row: `[-1.1925,  0.6984, -1.4097]` is the embedding vector for the token with index `3`.

  - Fourth Row: `[ 0.1794,  1.8951,  1.3689]` is the embedding vector for the token with index `4`.


These vectors are initially random, and during the training process, the model will update these vectors to capture meaningful information about the relationships between the tokens.

`grad_fn=<EmbeddingBackward0>:` This part indicates that the embedding operation is differentiable, and this tensor's values are computed in a way that allows gradients to be tracked. The `grad_fn=<EmbeddingBackward0>` shows that this operation supports `backpropagation`, meaning the values in the embedding matrix will be updated during training.


**To summarize:**

- The output contains the embeddings (dense vectors) for the input indices `[1, 2, 3, 4]`.

- Each index is mapped to a unique 3-dimensional vector (since `embedding_dim=3`).

- The embeddings are currently random but will be learned during training as the model optimizes its weights.

In [16]:
# Define an embedding layer for a vocabulary of 10 words, with each embedding being 5-dimensional
embedding_layer = nn.Embedding(num_embeddings=10, embedding_dim=5)

# Input a tensor of word indices
input_indices = torch.tensor([[0, 2, 5, 4],
                              [6, 4, 2, 3],
                              [8, 2, 5, 9]])

# Get the embeddings for the input indices
embeddings = embedding_layer(input_indices)
print(embeddings)

tensor([[[ 0.3178, -1.4041, -1.6237, -3.1146, -1.2449],
         [ 0.8456, -1.3980,  1.0862, -0.8557,  0.7466],
         [-0.1966,  0.2221,  1.7297, -0.1098,  1.2997],
         [ 0.4546,  1.4348, -1.8808,  1.0109, -0.3142]],

        [[ 3.3002,  1.0113, -0.3251, -0.8544,  0.7233],
         [ 0.4546,  1.4348, -1.8808,  1.0109, -0.3142],
         [ 0.8456, -1.3980,  1.0862, -0.8557,  0.7466],
         [ 1.3938,  3.2831,  0.7804, -1.8471, -0.3983]],

        [[ 1.5156,  1.9463,  0.7986, -0.8951,  0.0356],
         [ 0.8456, -1.3980,  1.0862, -0.8557,  0.7466],
         [-0.1966,  0.2221,  1.7297, -0.1098,  1.2997],
         [ 0.4820, -0.7725,  0.1360,  0.3886, -0.5229]]],
       grad_fn=<EmbeddingBackward0>)


In [17]:
print(embeddings.shape)

torch.Size([3, 4, 5])


The output shape `torch.Size([3, 4, 5])` from an embedding layer means:

- **3**: Batch size (number of sequences)
- **4**: Sequence length (number of tokens per sequence)  
- **5**: Embedding dimension (size of each token's vector representation)

Each token in the input is replaced with a 5-dimensional dense vector.