# <Font color = 'pickle'>**PyTorch Embedding Layers**

In this lecture we will learn more about embeddings like how to use torch.nn.Embedding and torch.nn.EmbeddingBag.

# <Font color = 'pickle'>**Introduction**

<font size = 5, color = 'pickle'>**Embedding**

- This layer is a lookup table that stores word embeddings of a fixed dictionary and size.
- The word embeddings can be retrieved using indices, where the index is the index of the word in the vocabulary.
- The input to this layer is a sequence of integer indices, where each index represents a word in the input sentence or document.
- The output of this layer is a sequence of word embeddings, where each embedding represents a word in the input sequence.
- The embeddings are initialized randomly and are learned during training using backpropagation.
- The size of the embeddings is specified when the layer is created, and is typically a hyperparameter that is tuned based on the specific task and dataset.
- This layer is commonly used in natural language processing (NLP) tasks, such as text classification, sentiment analysis, and machine translation.

<font size = 5, color = 'pickle'>**EmbeddingBag**

* This is an extension of nn.Embedding layer.
* In simple terms, EmbeddingBag is a two step process:
    - The first step is to create an embedding and the second step is to reduce (sum/mean/max, according to the "mode" argument) the embedding output across dimension 1.
    - So we can get the same result that nn.EmbeddingBag gives by calling torch.nn.functional.embedding, followed by torch.sum/mean/max.
* However, EmbeddingBag is much more time and memory efficient than using a Embedding followed by sum/min/max.

# <font color = 'pickle'> **Install/ Update/ Import useful libraries**

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install torchtext --upgrade



In [None]:
# Import PyTorch library for tensor computation and deep learning
import torch

# Import neural network module from PyTorch for building neural network layers
import torch.nn as nn

# Import pandas for data manipulation and analysis
import pandas as pd

# Import vocab from torchtext for handling vocabulary and text preprocessing
from torchtext.vocab import vocab

# Import Counter from collections for counting elements in collections like lists
from collections import Counter

# Import Dataset and DataLoader from PyTorch for handling data loading and batching
from torch.utils.data import Dataset, DataLoader


# <Font color = 'pickle'>**Load Data**

In [None]:
# Generate some data
data = {
    "label": [0, 1, 1, 0],
    "data": [
        "Movie was bad",
        "Movie was good",
        "It was thrilling.",
        "It was horrible. "
    ]
}

In [None]:
df = pd.DataFrame(data)

In [None]:
df.head()

Unnamed: 0,label,data
0,0,Movie was bad
1,1,Movie was good
2,1,It was thrilling.
3,0,It was horrible.


# <Font color = 'pickle'>**Create Custom Torch Dataset**

In [None]:
X = df['data']
y = df['label']

In [None]:
class CustomDataset(Dataset):
    """
    Custom Dataset class inheriting from PyTorch's Dataset class.
    Intended to handle custom text and label data.

    Attributes:
        X (pd.Series): The input features (text).
        y (pd.Series): The labels corresponding to the input features.
    """

    def __init__(self, X, y):
        """
        Initialize the dataset with input features and labels.

        Parameters:
            X (pd.Series): Input features.
            y (pd.Series): Labels corresponding to input features.
        """
        self.X = X  # Input features (text)
        self.y = y  # Corresponding labels

    def __len__(self):
        """
        Return the total number of samples in the dataset.

        Returns:
            int: Number of samples in the dataset.
        """
        return len(self.X)  # Return the length of the dataset

    def __getitem__(self, idx):
        """
        Fetch and return a single sample from the dataset at the given index.

        Parameters:
            idx (int): Index of the sample to fetch.

        Returns:
            tuple: A tuple containing the label and the input feature (text) at the index.
        """
        text = self.X.iloc[idx]  # Fetch the input feature at the given index
        labels = self.y.iloc[idx]  # Fetch the corresponding label
        sample = (labels, text)  # Create a tuple of label and input feature

        return sample  # Return the sample as a tuple


In [None]:
# Create an instance of CustomDataset with the input features X and labels y for training
train_dataset = CustomDataset(X, y)


In [None]:
# Iterate through the train_dataset, printing the index, label (y), and input feature (x) for each sample
for i, (y, x) in enumerate(train_dataset):
    print(i, y, x)


0 0 Movie was bad
1 1 Movie was good
2 1 It was thrilling.
3 0 It was horrible. 


In [None]:
# Retrieve the sample at index 2 from train_dataset using the __getitem__ method
train_dataset.__getitem__(2)



(1, 'It was thrilling.')

In [None]:
# Retrieve the sample at index 2 from train_dataset using Python's built-in indexing syntax
train_dataset[2]


(1, 'It was thrilling.')

Key Points:

1. **Retrieval**: The line is used to fetch the sample located at index 2 in `train_dataset`.
2. **Syntactic Sugar**: It utilizes Python's built-in indexing syntax, which internally calls the `__getitem__` method.



# <Font color = 'pickle'>**Create Vocab**

In [None]:
# Initialize an empty Counter object to hold the word frequencies
counter = Counter()

# Loop through each sample in train_dataset to count word occurrences
for (label, line) in train_dataset:
    # Split the line into words and update their frequencies in the counter
    counter.update(str(line).split())


Key Points:

1. **Counter Initialization**: A Counter object is initialized to hold the frequencies of individual words.
2. **Dataset Looping**: The `for` loop iterates through each sample in `train_dataset`.
3. **Word Counting**: Each line (text sample) is split into words, which are then used to update the Counter object.


In [None]:
counter

Counter({'Movie': 2,
         'was': 4,
         'bad': 1,
         'good': 1,
         'It': 2,
         'thrilling.': 1,
         'horrible.': 1})

In [None]:
# Create a vocabulary using the word frequencies stored in the counter, with a minimum frequency of 1 for inclusion
my_vocab = vocab(counter, min_freq=1)


Key Points:

1. **Vocabulary Creation**: The line initializes a vocabulary object using the word frequencies gathered so far.
2. **Minimum Frequency**: Words are included in the vocabulary only if their frequency is at least 1, as specified by the `min_freq` parameter.

In [None]:
# Output or examine the contents of the my_vocab object to understand the constructed vocabulary
my_vocab


Vocab()

Key Points:

1. **Output/Examination**: The line is likely used to output or inspect the `my_vocab` object.
2. **Vocabulary Object**: `my_vocab` holds the vocabulary constructed from the word frequencies in the dataset.



In [None]:
# Retrieve the word-to-index mapping from the my_vocab object
my_vocab.get_stoi()


{'thrilling.': 5,
 'bad': 2,
 'was': 1,
 'horrible.': 6,
 'It': 4,
 'good': 3,
 'Movie': 0}

In [None]:
# Insert the '<unk>' token at index 0 in my_vocab to represent any unknown words
my_vocab.insert_token('<unk>', 0)


Key Points:

1. **Token Insertion**: The line adds a special token `<unk>` to the vocabulary.
2. **Handling Unknown Words**: The purpose of this token is to represent any unknown words encountered during the model's operation.
3. **Index Position**: The token is inserted at index 0, as specified by the second argument.

In [None]:
# check mapping of words to index
my_vocab.get_stoi()

{'thrilling.': 6,
 'bad': 3,
 'was': 2,
 'horrible.': 7,
 'It': 5,
 'good': 4,
 'Movie': 1,
 '<unk>': 0}

In [None]:
# Print vocab indices for some random text
[my_vocab[token] for token in 'Movie was bad'.split()]

[1, 2, 3]

In [None]:
# check whether word hello is in dictionary
'hello' in my_vocab

False

In [None]:
# get the index for  the word hello
# since this word is not in the dictionary we should get an error
try:
    my_vocab['hello']
except RuntimeError:
    print('token not found in vocab')

token not found in vocab


In [None]:
# set the default index to zero
# thus any uknown word will be represented b index 0 or token '<unk>'
my_vocab.set_default_index(0)

In [None]:
# again check if the word hello is in the dict
print('hello' in my_vocab)


False


In [None]:
# get the index for  the word hello
# since we set default index to 0, now it should return 0 for the word hello
my_vocab['hello']

0

# <Font color = 'pickle'>**Create DataLoader for Embedding**

In [None]:
def text_pipeline(x, vocab):
    """
    Converts a text string into a list of vocabulary indices.

    Parameters:
        x (str): The input text string to be converted.
        vocab (vocab object): The vocabulary object containing the word-to-index mapping.

    Returns:
        list: A list of integers representing the vocabulary indices of the words in the input string.
    """
    # Tokenize the input string, then map each token to its corresponding index in the given vocabulary
    return [vocab[token] for token in str(x).split()]


In [None]:
# check the function
text_pipeline('Movie was bad', my_vocab)

[1, 2, 3]

In [None]:
def collate_batch(batch):
    """
    Collates a batch of samples into tensors of labels and texts.

    Parameters:
        batch (list): A list of tuples, each containing a label and a text.

    Returns:
        tuple: A tuple containing two tensors, one for labels and one for texts.
    """
    # Unpack the batch into separate lists for labels and texts
    labels, texts = zip(*batch)

    # Convert the list of labels into a tensor of dtype int32
    labels = torch.tensor(labels, dtype=torch.int32)

    # Convert the list of texts into a tensor; each text is transformed into a list of vocabulary indices using text_pipeline
    texts = torch.tensor([text_pipeline(text, my_vocab) for text in texts], dtype=torch.int32)

    return labels, texts


Code Explanation:

- The function `collate_batch` accepts a list of tuples, each containing a `label` and a `text`.
- The statement `zip(*batch)` separates the list of tuples into two distinct lists: one for `labels` and another for `texts`.
- The list of `labels` is converted to a PyTorch tensor using `torch.tensor`. The data type (`dtype`) is explicitly set to `torch.int64` to ensure compatibility with PyTorch's requirements.
- The `texts` undergo a transformation via the `text_pipeline` function within a list comprehension. This results in a list of lists, where each inner list is a sequence of integer indices representing words. This list is then converted into a PyTorch tensor, also with the dtype set to `torch.int64` (not `torch.int32` as previously mentioned).
- Finally, the function returns a tuple consisting of the `labels` and `texts` tensors, ready for further processing.



**----Digression Understanding zip, zip(*)-----**

In [None]:
x = [1, 2, 3]
y = [11, 12, 13]
z = zip(x, y)
print(x, y, z)

[1, 2, 3] [11, 12, 13] <zip object at 0x7dc7b3d12680>


In [None]:
temp = list(z)

In [None]:
temp

[(1, 11), (2, 12), (3, 13)]

In [None]:
temp[0]

(1, 11)

In [None]:
x1, y1 = zip(*temp)

In [None]:
print(x1, y1)

(1, 2, 3) (11, 12, 13)


**----END of Digression-----**

In [None]:
# check the function by passing complete dataset
collate_batch(train_dataset)

(tensor([0, 1, 1, 0], dtype=torch.int32),
 tensor([[1, 2, 3],
         [1, 2, 4],
         [5, 2, 6],
         [5, 2, 7]], dtype=torch.int32))

As we can see we got the labels along with indices of words.

In [None]:
# create DataLoader now
torch.manual_seed(0)
batch_size = 2
train_loader = DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True,
                                           collate_fn=collate_batch,
                                           )

In [None]:
# iterate over the dataloader
torch.manual_seed(0)
for label, text in train_loader:
    print(label, text)

tensor([1, 1], dtype=torch.int32) tensor([[5, 2, 6],
        [1, 2, 4]], dtype=torch.int32)
tensor([0, 0], dtype=torch.int32) tensor([[5, 2, 7],
        [1, 2, 3]], dtype=torch.int32)


# <Font color = 'pickle'>**Embedding Layer**

In [None]:
# Instantiating embedding layer with total number of embeddings and dimension of embedding i.e. dimesion of vector
torch.manual_seed(0)
model = nn.Embedding(num_embeddings=len(my_vocab), embedding_dim=5)

In [None]:
# check the weights associated with the embedding layer
model.weight

Parameter containing:
tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487],
        [ 0.6920, -0.3160, -2.1152,  0.3223, -1.2633],
        [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
        [-0.2473, -1.3527, -1.6959,  0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159],
        [-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
        [ 0.9383,  0.4889,  1.2032,  0.0845, -1.2001],
        [-0.0048, -0.5181, -0.3067, -1.5810,  1.7066]], requires_grad=True)

In [None]:
# itertae over the dataloader and check the output of te model
for y, x in train_loader:
    output = model(x)
    print('\nx\n', x)
    print('\ny\n', y)
    print('\nOutput Shape \n', output.shape)
    print('\nOutput\n', output)
    sentence_embedding = torch.mean(output, dim=1)
    print('-'*75)
    print('sentence_embedding')
    print(sentence_embedding)
    print('='*75)


x
 tensor([[1, 2, 4],
        [5, 2, 6]], dtype=torch.int32)

y
 tensor([1, 1], dtype=torch.int32)

Output Shape 
 torch.Size([2, 3, 5])

Output
 tensor([[[ 0.6920, -0.3160, -2.1152,  0.3223, -1.2633],
         [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
         [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159]],

        [[-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
         [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
         [ 0.9383,  0.4889,  1.2032,  0.0845, -1.2001]]],
       grad_fn=<EmbeddingBackward0>)
---------------------------------------------------------------------------
sentence_embedding
tensor([[ 0.5469, -0.5210, -0.7789,  1.1376, -0.1208],
        [ 0.1819,  0.4532,  0.5276,  0.3827, -0.2540]],
       grad_fn=<MeanBackward1>)

x
 tensor([[5, 2, 7],
        [1, 2, 3]], dtype=torch.int32)

y
 tensor([0, 0], dtype=torch.int32)

Output Shape 
 torch.Size([2, 3, 5])

Output
 tensor([[[-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
         [ 0.3500,  0.3081,  0.1198

In [None]:
# check the model output for a random indices (sentence)
output = model(torch.tensor([5, 3, 4, 5]))
output

tensor([[-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
        [-0.2473, -1.3527, -1.6959,  0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159],
        [-0.7425,  0.5627,  0.2596, -0.1740, -0.6787]],
       grad_fn=<EmbeddingBackward0>)

In [None]:
output.shape

torch.Size([4, 5])

In [None]:
torch.mean(output, dim=0)

tensor([-0.2834, -0.4456, -0.3795,  0.5179, -0.1950], grad_fn=<MeanBackward1>)

# <Font color = 'pickle'>**Create DataLoader for EmbeddingBag**

In [None]:
def collate_batch(batch):
    """
    Collates a batch of samples into tensors of labels, texts, and offsets.

    Parameters:
        batch (list): A list of tuples, each containing a label and a text.

    Returns:
        tuple: A tuple containing three tensors:
               - Labels tensor
               - Concatenated texts tensor
               - Offsets tensor indicating the start positions of each text in the concatenated tensor
    """
    # Unpack the batch into separate lists for labels and texts
    labels, texts = zip(*batch)

    # Convert the list of labels into a tensor of dtype int32
    labels = torch.tensor(labels, dtype=torch.int32)

    # Convert the list of texts into a list of lists; each inner list contains the vocabulary indices for a text
    list_of_list_of_indices = [text_pipeline(text, my_vocab) for text in texts]

    # Compute the offsets for each text in the concatenated tensor
    offsets = [0] + [len(i) for i in list_of_list_of_indices]
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)

    # Concatenate all text indices into a single tensor
    texts = torch.cat([torch.tensor(i, dtype=torch.int64) for i in list_of_list_of_indices])

    return labels, texts, offsets


    # [[1,2,3 ], [3,4,5]]
    # [1,2,3,4,5,6] [0, 3]


Code Explanation:

- `text_pipeline` is a utility function that transforms a text string into a list of vocabulary indices. It takes a text string and a vocabulary object, then uses the vocabulary to map each word in the text to its corresponding index.
  
- `collate_batch` is designed to transform a batch of labeled text data into a format that can be directly fed into a neural network for training or inference.
  
- The function accepts an input batch, which is a list of tuples. Each tuple consists of a label (`label`) and a text string (`text`).

- Inside `collate_batch`, `zip(*batch)` is utilized to separate the batch into two distinct lists: one for `labels` and another for `texts`.

- The list of `labels` is promptly converted into a PyTorch tensor with data type set to `torch.int32`.

- For each text string in `texts`, `text_pipeline` is invoked to transform it into a list of vocabulary indices. These lists are then stored in another list named `list_of_list_of_indices`.

- The individual lists within `list_of_list_of_indices` are concatenated into a single PyTorch tensor using `torch.cat`. This tensor holds the entire batch of text data in index form.

- To manage the original boundary of each text within the concatenated tensor, an `offsets` tensor is computed. It starts with a zero and is followed by the cumulative sum of the lengths of the individual text index lists.

- The tensors for `labels`, `texts`, and `offsets` are packaged into a tuple and returned as the final output of `collate_batch`.



In [None]:
# create data loader now
torch.manual_seed(0)
batch_size = 2
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True,
                                           collate_fn=collate_batch,
                                           )

In [None]:
# iterate over the data loader to see the output
torch.manual_seed(0)
for label, text, offsets in train_loader:
    print(label, text, offsets)


tensor([1, 1], dtype=torch.int32) tensor([5, 2, 6, 1, 2, 4]) tensor([0, 3])
tensor([0, 0], dtype=torch.int32) tensor([5, 2, 7, 1, 2, 3]) tensor([0, 3])


# <Font color = 'pickle'>**Embedding Bag Layer**

In [None]:
# Instantiating EmbeddingBag layer with total number of embeddings and dimension of embedding
# i.e. dimension of vector

torch.manual_seed(0)
model = nn.EmbeddingBag(len(my_vocab), 5)

In [None]:
model.weight

Parameter containing:
tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487],
        [ 0.6920, -0.3160, -2.1152,  0.3223, -1.2633],
        [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
        [-0.2473, -1.3527, -1.6959,  0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159],
        [-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
        [ 0.9383,  0.4889,  1.2032,  0.0845, -1.2001],
        [-0.0048, -0.5181, -0.3067, -1.5810,  1.7066]], requires_grad=True)

In [None]:
for label, text, offsets in train_loader:
    output = model(text, offsets)
    print('Output')
    print(output)
    print(output.shape)
    print('='*75)

Output
tensor([[ 0.5469, -0.5210, -0.7789,  1.1376, -0.1208],
        [ 0.1819,  0.4532,  0.5276,  0.3827, -0.2540]],
       grad_fn=<EmbeddingBagBackward0>)
torch.Size([2, 5])
Output
tensor([[-0.1325,  0.1176,  0.0243, -0.1724,  0.7149],
        [ 0.2649, -0.4535, -1.2304,  0.7089,  0.2157]],
       grad_fn=<EmbeddingBagBackward0>)
torch.Size([2, 5])


---