# <Font color = 'pickle'>**PyTorch Embedding Layers**

In this lecture we will learn more about embeddings like how to use torch.nn.Embedding and torch.nn.EmbeddingBag.

# <Font color = 'pickle'>**Introduction**

<font size = 5, color = 'pickle'>**Embedding**

* This layer is a  lookup table that stores word embeddings of a fixed dictionary and size.
* The word embeddings can be retrieved using indices, here index is the index of word in vocab.

<font size = 5, color = 'pickle'>**EmbeddingBag**

* This is an extension of nn.Embedding layer.
* In simple terms, EmbeddingBag is a two step process:
    - The first step is to create an embedding and the second step is to reduce (sum/mean/max, according to the "mode" argument) the embedding output across dimension 1.
    - So we can get the same result that EmbeddingBag gives by calling torch.nn.functional.embedding, followed by torch.sum/mean/max.
* However, EmbeddingBag is much more time and memory efficient than using a Embedding followed by sum/min/max.

#<font color = 'pickle'> **Install/ Update/ Import useful libraries**

In [1]:
if 'google.colab' in str(get_ipython()):
  !pip install torchtext --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import torch
import torch.nn as nn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from torchtext.vocab import vocab
from collections import Counter, OrderedDict

# <Font color = 'pickle'>**Load Data**

In [3]:
# Generate some data 
data = {
    "label": [0,1,1,0],
    "data": [
        "Movie was bad",
        "Movie was good",
        "It was thrilling",
        "It was horrible"
    ]
}

In [4]:
df = pd.DataFrame(data)

In [5]:
df.head()

Unnamed: 0,label,data
0,0,Movie was bad
1,1,Movie was good
2,1,It was thrilling
3,0,It was horrible


#<Font color = 'pickle'>**Create Custom Torch Dataset**

In [6]:
X = df['data']
y = df['label']

In [44]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
 
    def __init__(self, X,y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        text = self.X.iloc[idx]       
        labels = self.y.iloc[idx]
        sample = (labels,text)
        
        return sample

In [8]:
train_dataset = CustomDataset(X,y)

In [9]:
for i, (y, x) in enumerate(train_dataset):
    print(i, y, x)

0 0 Movie was bad
1 1 Movie was good
2 1 It was thrilling
3 0 It was horrible


In [11]:
train_dataset.__getitem__([2])

(2    1
 Name: label, dtype: int64, 2    It was thrilling
 Name: data, dtype: object)

# <Font color = 'pickle'>**Create Vocab**

In [12]:
from collections import Counter

counter = Counter()
for (label, line) in train_dataset:
   counter.update(str(line).split())

In [13]:
counter

Counter({'Movie': 2,
         'was': 4,
         'bad': 1,
         'good': 1,
         'It': 2,
         'thrilling': 1,
         'horrible': 1})

In [14]:
from torchtext.vocab import  vocab

In [15]:
# craeting vocab using the vocab object from trochtext
my_vocab = vocab(counter, min_freq=1)

In [16]:
my_vocab

Vocab()

In [17]:
# get mapping of words to index
my_vocab.get_stoi()

{'horrible': 6,
 'bad': 2,
 'was': 1,
 'Movie': 0,
 'good': 3,
 'thrilling': 5,
 'It': 4}

In [18]:
# insert '<unk>' token to represent any unknown word
my_vocab.insert_token('<unk>', 0)

In [19]:
# check mapping of words to index
my_vocab.get_stoi()

{'horrible': 7,
 'thrilling': 6,
 'good': 4,
 '<unk>': 0,
 'Movie': 1,
 'was': 2,
 'bad': 3,
 'It': 5}

In [20]:
# Print vocab indices for some random text
[my_vocab[token] for token in 'Movie was bad'.split()]

[1, 2, 3]

In [21]:
# check whether word hello is in dictionary
'hello' in my_vocab

False

In [22]:
# get the index for  the word hello
# since this word is not in the dictionary we should get an error
my_vocab['hello']

RuntimeError: ignored

In [23]:
# set the default index to zero
# thus any uknown word will be represented b index 0 or token '<unk>'
my_vocab.set_default_index(0)

In [24]:
# again check if the word hello is in the dict
print('hello' in my_vocab)

False


In [25]:
# get the index for  the word hello
# since we set default index to 0, now it should return 0 for the word hello
my_vocab['hello']

0

# <Font color = 'pickle'>**Create DataLoader for Embedding**

In [26]:
# Creating a lambda function objects that will be used to get the indices of words from vocab
text_pipeline = lambda x: [my_vocab[token] for token in str(x).split()]
label_pipeline = lambda x: int(x)

In [27]:
# check the function
text_pipeline('Movie was bad')

[1, 2, 3]

In [28]:
'''
The input to the embedding layers are indices of words from the vocab.
The collate_batch() accepts batch of data and gets the indices of text from vocab and returns the same
We will include the collate_batch() in collat_fn attribute of DataLoader.
So it will create a batch of data containing indices of words and corresponding labels.
'''
def collate_batch(batch):
    label_list, text_list =[],[]

    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.stack(text_list)
    return label_list, text_list

In [29]:
# check the function by passing complete dataset
collate_batch(train_dataset)

(tensor([0, 1, 1, 0]), tensor([[1, 2, 3],
         [1, 2, 4],
         [5, 2, 6],
         [5, 2, 7]]))

As we can see we got the labels along with indices of words.

In [30]:
# create DataLoader now
torch.manual_seed(0)
batch_size=2
train_loader= torch.utils.data.DataLoader(dataset=train_dataset,
                                        batch_size=batch_size,
                                        shuffle=True,
                                        collate_fn=collate_batch,
                                       )

In [31]:
# iterate over the dataloader
torch.manual_seed(0)
for label, text in train_loader:
  print(label, text)

tensor([1, 1]) tensor([[5, 2, 6],
        [1, 2, 4]])
tensor([0, 0]) tensor([[5, 2, 7],
        [1, 2, 3]])


# <Font color = 'pickle'>**Embedding Layer**

In [32]:
# Instantiating embedding layer with total number of embeddings and dimension of embedding i.e. dimesion of vector
torch.manual_seed(0)
model = nn.Embedding(num_embeddings=len(my_vocab),embedding_dim=5)

In [33]:
# check the weights associated with the embedding layer
model.weight

Parameter containing:
tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487],
        [ 0.6920, -0.3160, -2.1152,  0.3223, -1.2633],
        [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
        [-0.2473, -1.3527, -1.6959,  0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159],
        [-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
        [ 0.9383,  0.4889,  1.2032,  0.0845, -1.2001],
        [-0.0048, -0.5181, -0.3067, -1.5810,  1.7066]], requires_grad=True)

In [34]:
# itertae over the dataloader and check the output of te model
for y, x in train_loader:
    output = model(x)
    print('\nx\n', x)
    print('\ny\n', y)
    print('\nOutput\n', output)
    sentence_embedding = torch.mean(output, dim=1)
    print('-'*75)
    print('sentence_embedding')
    print(sentence_embedding)
    print('='*75)


x
 tensor([[1, 2, 4],
        [5, 2, 6]])

y
 tensor([1, 1])

Output
 tensor([[[ 0.6920, -0.3160, -2.1152,  0.3223, -1.2633],
         [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
         [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159]],

        [[-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
         [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
         [ 0.9383,  0.4889,  1.2032,  0.0845, -1.2001]]],
       grad_fn=<EmbeddingBackward0>)
---------------------------------------------------------------------------
sentence_embedding
tensor([[ 0.5469, -0.5210, -0.7789,  1.1376, -0.1208],
        [ 0.1819,  0.4532,  0.5276,  0.3827, -0.2540]],
       grad_fn=<MeanBackward1>)

x
 tensor([[5, 2, 7],
        [1, 2, 3]])

y
 tensor([0, 0])

Output
 tensor([[[-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
         [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
         [-0.0048, -0.5181, -0.3067, -1.5810,  1.7066]],

        [[ 0.6920, -0.3160, -2.1152,  0.3223, -1.2633],
         [ 0.3500

In [35]:
# check the model output for a random indices (sentence)
output  = model(torch.tensor([5, 3, 4, 5]))
output

tensor([[-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
        [-0.2473, -1.3527, -1.6959,  0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159],
        [-0.7425,  0.5627,  0.2596, -0.1740, -0.6787]],
       grad_fn=<EmbeddingBackward0>)

In [36]:
torch.mean(output,dim=0)

tensor([-0.2834, -0.4456, -0.3795,  0.5179, -0.1950], grad_fn=<MeanBackward1>)

# <Font color = 'pickle'>**Create DataLoader for EmbeddingBag**

In [37]:
'''
We know that input to the embedding layers are indices of words from the vocab.
The collate_batch() accepts batch of data and gets the indices of text from vocab and returns the same
We will include this collate_batch() in collat_fn attribute of DataLoader.
So it will create a batch of data containing indices of words and corresponding labels.
But for EmbeddingBag we need one more extra parameter, that is offset.
offsets determines the starting index position of each bag (sequence) in input.
'''
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list, text_list, offsets

    #[1,2,3 ] 
    #[3,4,5] 

    #[1,2,3,4,5,6]


In [38]:
# create data loader now
torch.manual_seed(0)
batch_size=2
train_loader= torch.utils.data.DataLoader(dataset=train_dataset,
                                        batch_size=batch_size,
                                        shuffle=True,
                                        collate_fn=collate_batch,
                                        )

In [39]:
# iterate over the data loader to see the output
torch.manual_seed(0)
for label, text, offsets in train_loader:
  print(label, text, offsets)

tensor([1, 1]) tensor([5, 2, 6, 1, 2, 4]) tensor([0, 3])
tensor([0, 0]) tensor([5, 2, 7, 1, 2, 3]) tensor([0, 3])


# <Font color = 'pickle'>**Embedding Bag Layer**

In [41]:
# Instantiating embeddingbag layer with total number of embeddings and dimension of embedding i.e. dimesion of vector
torch.manual_seed(0)
model = nn.EmbeddingBag(len(my_vocab),5)

In [42]:
model.weight

Parameter containing:
tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487],
        [ 0.6920, -0.3160, -2.1152,  0.3223, -1.2633],
        [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168],
        [-0.2473, -1.3527, -1.6959,  0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  1.8530, -0.2159],
        [-0.7425,  0.5627,  0.2596, -0.1740, -0.6787],
        [ 0.9383,  0.4889,  1.2032,  0.0845, -1.2001],
        [-0.0048, -0.5181, -0.3067, -1.5810,  1.7066]], requires_grad=True)

In [43]:
for label, text, offsets in train_loader:
    output=model(text, offsets)
    print('Output')
    print(output)
    print(output.shape)
    print('='*75)

Output
tensor([[ 0.5469, -0.5210, -0.7789,  1.1376, -0.1208],
        [ 0.1819,  0.4532,  0.5276,  0.3827, -0.2540]],
       grad_fn=<EmbeddingBagBackward0>)
torch.Size([2, 5])
Output
tensor([[-0.1325,  0.1176,  0.0243, -0.1724,  0.7149],
        [ 0.2649, -0.4535, -1.2304,  0.7089,  0.2157]],
       grad_fn=<EmbeddingBagBackward0>)
torch.Size([2, 5])


---