## PyTorch/Embedding and EmbeddingBag

Embedding is a class that represents an embedding layer. It accepts token indices and produces embedding vectors. EmbeddingBag is a class that aggregates embeddings using mean or sum operations. Embedding and EmbeddingBag are part of the torch.nn module. The code example shows how you can use Embedding and EmbeddingBag in PyTorch.

`torch:` The core PyTorch library used for building and training deep learning models.                          
`torch.nn:` A module in PyTorch that contains classes for building neural network layers.                           
`tensor:` A multi-dimensional matrix containing elements of a single data type, essential for PyTorch computations.

In [6]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import torch

In [7]:
# Defining a data set
dataset = [
"I like cats",
"I hate dogs",
"I'm impartial to hippos"
]
#Initializing the tokenizer, iterator from the data set, and vocabulary
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
def yield_tokens(data_iter):
    for data_sample in data_iter:
        yield tokenizer(data_sample)
data_iter = iter(dataset)
vocab = build_vocab_from_iterator(yield_tokens(data_iter))
#Tokenizing and generating indices
input_ids=lambda x:[torch.tensor(vocab(tokenizer(data_sample))) for data_sample in dataset]
index=input_ids(dataset)
print(index)

[tensor([0, 7, 2]), tensor([0, 4, 3]), tensor([0, 1, 6, 8, 5])]


In [16]:
import torch
import torch.nn as nn
from torch import tensor
# import torch.nn.functional as F

#Initiating the embedding layer, specifying the dimension size for the embeddings, 
#determining the count of unique tokens present in the vocabulary, and creating the embedding layer
embedding_dim = 3
n_embedding = len(vocab) #9
embeds = nn.Embedding(n_embedding, embedding_dim)
#Applying the embedding object
i_like_cats=embeds(index[0])
print(f'embeddings of i_like_cats: {i_like_cats}')

i_hate_dogs=embeds(index[1])
print(f'embeddings of i_hate_dogs: {i_hate_dogs}')

impartial_to_hippos=embeds(index[-1])
print(f'embeddings of impartial_to_hippos: {impartial_to_hippos}')

embeddings of i_like_cats: tensor([[ 0.0696, -0.0102, -1.1019],
        [ 0.7790, -0.9232, -0.1057],
        [ 0.5286,  1.6652,  0.1281]], grad_fn=<EmbeddingBackward0>)
embeddings of i_hate_dogs: tensor([[ 0.0696, -0.0102, -1.1019],
        [ 0.9367, -2.2018,  0.7229],
        [ 0.4811, -1.5437, -0.5988]], grad_fn=<EmbeddingBackward0>)
embeddings of impartial_to_hippos: tensor([[ 0.0696, -0.0102, -1.1019],
        [-0.8129,  0.5123,  1.2359],
        [ 1.2870, -1.0623, -0.3928],
        [-0.5383,  0.1056,  1.2875],
        [ 0.7156,  0.2015, -0.1643]], grad_fn=<EmbeddingBackward0>)


In [18]:
#Initiating the embedding layer, specifying the dimension size for the embeddings, 
#determining the count of unique tokens present in the vocabulary, and creating the embedding layer
embedding_dim = 3
n_embedding = len(vocab) #9
embedding_bag = nn.EmbeddingBag(n_embedding, embedding_dim)

#Applying the embedding object
print(f'embeddings of i_like_cats: {embedding_bag(index[0],offsets=torch.tensor([0]))}')

print(f'embeddings of i_hate_dogs: {embedding_bag(index[1],offsets=torch.tensor([0]))}')

print(f'embeddings of impartial_to_hippos: {embedding_bag(index[-1],offsets=torch.tensor([0]))}')

embeddings of i_like_cats: tensor([[-0.0531,  1.1954,  0.8458]], grad_fn=<EmbeddingBagBackward0>)
embeddings of i_hate_dogs: tensor([[0.6154, 0.3378, 0.1228]], grad_fn=<EmbeddingBagBackward0>)
embeddings of impartial_to_hippos: tensor([[ 0.8911, -0.0842,  0.5150]], grad_fn=<EmbeddingBagBackward0>)


Explanation of Modes in EmbeddingBag                                                                     
`embedding_bag = nn.EmbeddingBag(num_embeddings, embedding_dim, mode='mean')`

The mode parameter in nn.EmbeddingBag defines how the embeddings of the tokens in each "bag" (or sequence) are aggregated. The available options are:                                                
'sum': Sums the embeddings of all tokens in the bag.                                        
'mean': Averages the embeddings of all tokens in the bag. It is default                                  
'max': Takes the maximum value for each dimension across all token embeddings in the bag.                      

`Need of offsets:`                                        
Especially incase of n grams, we usually concatinate the sentens based on the model configs, for example for 3 gram model we combine first 2 words as a context and 3rd word as a target word. 
imagine if indexes of these concatinated words are [0, 1] and [2, 3, 4]
so in this case, we can embedd these two words at the same time. using below code

In [19]:
# New indices representing two sequences: [0, 1] and [2, 3, 4]
indices = torch.tensor([0, 1, 2, 3, 4])  # Concatenate indices of all sequences

# Offsets to mark the start of each sequence in `indices`
offsets = torch.tensor([0, 2])  # First sequence starts at index 0, second starts at index 2

output = embedding_bag(indices, offsets)
print(output)


tensor([[ 0.7735,  1.2749,  0.8850],
        [ 0.7659, -0.3560,  0.3488]], grad_fn=<EmbeddingBagBackward0>)


## Batch function                                                               
Defines the number of samples that will be propagated through the network.

In [None]:
def collate_batch(batch):
    target_list, context_list, offsets = [], [], [0]
    for _context, _target in batch:
        target_list.append(vocab[_target]) 
        processed_context = torch.tensor(text_pipeline(_context), dtype=torch.int64)
        context_list.append(processed_context)
        offsets.append(processed_context.size(0))
        target_list = torch.tensor(target_list, dtype=torch.int64)
        offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
        context_list = torch.cat(context_list)
    return target_list.to(device), context_list.to(device), offsets.to(device)

BATCH_SIZE = 64 # batch size for training
dataloader_cbow = DataLoader(cobw_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)