## Embeddings

In our previous example, we worked with high-dimensional bag-of-words vectors of length `vocab_size`, explicitly converting low-dimensional positional representation vectors into sparse one-hot representations. This one-hot representation is not memory-efficient, and furthermore, each word is treated as completely independent from the others. In other words, one-hot encoded vectors do not capture any semantic similarity between words.

In this unit, we will continue exploring the **News AG** dataset. To start, let's load the data and retrieve some definitions from the previous notebook.


In [1]:
import torch
import torchtext
import numpy as np
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)
print("Vocab size = ",vocab_size)

Loading dataset...


d:\WORK\ai-for-beginners\5-NLP\14-Embeddings\data\train.csv: 29.5MB [00:01, 18.8MB/s]                            
d:\WORK\ai-for-beginners\5-NLP\14-Embeddings\data\test.csv: 1.86MB [00:00, 11.2MB/s]                          


Building vocab...
Vocab size =  95812


## What is embedding?

The concept of **embedding** involves representing words as lower-dimensional dense vectors that capture the semantic meaning of a word. We'll explore how to create meaningful word embeddings later, but for now, think of embeddings as a method to reduce the dimensionality of a word vector.

An embedding layer takes a word as input and outputs a vector of a specified `embedding_size`. In essence, it functions similarly to a `Linear` layer, but instead of requiring a one-hot encoded vector, it can directly take a word number as input.

By using an embedding layer as the first layer in our network, we can transition from a bag-of-words model to an **embedding bag** model. In this approach, each word in the text is first converted into its corresponding embedding, and then an aggregate function—such as `sum`, `average`, or `max`—is applied to all those embeddings.

![Image showing an embedding classifier for five sequence words.](../../../../../translated_images/embedding-classifier-example.b77f021a7ee67eeec8e68bfe11636c5b97d6eaa067515a129bfb1d0034b1ac5b.en.png)

Our neural network classifier will begin with an embedding layer, followed by an aggregation layer, and finally a linear classifier on top:


In [2]:
class EmbedClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.fc = torch.nn.Linear(embed_dim, num_class)

    def forward(self, x):
        x = self.embedding(x)
        x = torch.mean(x,dim=1)
        return self.fc(x)

### Dealing with variable sequence size

Due to this architecture, minibatches for our network need to be constructed in a specific way. In the previous section, when using bag-of-words, all BoW tensors in a minibatch had the same size `vocab_size`, regardless of the actual length of the text sequence. However, when transitioning to word embeddings, the number of words in each text sample will vary, and when combining these samples into minibatches, padding will need to be applied.

This can be achieved by using the same approach of providing a `collate_fn` function to the datasource:


In [3]:
def padify(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label, 
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [encode(x[1]) for x in b]
    # first, compute max length of a sequence in this minibatch
    l = max(map(len,v))
    return ( # tuple of two tensors - labels and features
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
    )

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)

### Training embedding classifier

Now that we have set up the appropriate dataloader, we can train the model using the training function we defined in the previous section:


In [4]:
net = EmbedClassifier(vocab_size,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=1, epoch_size=25000)

3200: acc=0.6415625
6400: acc=0.6865625
9600: acc=0.7103125
12800: acc=0.726953125
16000: acc=0.739375
19200: acc=0.75046875
22400: acc=0.7572321428571429


(0.889799795315499, 0.7623160588611644)

**Note**: We are only training for 25k records here (less than one full epoch) for the sake of time, but you can continue training, write a function to train for several epochs, and experiment with learning rate parameter to achieve higher accuracy. You should be able to go to the accuracy of about 90%.


### EmbeddingBag Layer and Variable-Length Sequence Representation

In the previous architecture, we had to pad all sequences to the same length to fit them into a minibatch. This is not the most efficient way to represent variable-length sequences. An alternative approach is to use an **offset** vector, which stores the starting positions of all sequences within a single large vector.

![Image showing an offset sequence representation](../../../../../translated_images/offset-sequence-representation.eb73fcefb29b46eecfbe74466077cfeb7c0f93a4f254850538a2efbc63517479.en.png)

> **Note**: In the image above, we illustrate a sequence of characters, but in our example, we are working with sequences of words. However, the general principle of representing sequences with an offset vector remains the same.

To work with the offset representation, we use the [`EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html) layer. It is similar to `Embedding`, but it takes both a content vector and an offset vector as input. Additionally, it includes an aggregation layer, which can perform operations like `mean`, `sum`, or `max`.

Below is the modified network that uses `EmbeddingBag`:


In [5]:
class EmbedClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = torch.nn.EmbeddingBag(vocab_size, embed_dim)
        self.fc = torch.nn.Linear(embed_dim, num_class)

    def forward(self, text, off):
        x = self.embedding(text, off)
        return self.fc(x)

To prepare the dataset for training, we need to provide a conversion function that will prepare the offset vector:


In [6]:
def offsetify(b):
    # first, compute data tensor from all sequences
    x = [torch.tensor(encode(t[1])) for t in b]
    # now, compute the offsets by accumulating the tensor of sequence lengths
    o = [0] + [len(t) for t in x]
    o = torch.tensor(o[:-1]).cumsum(dim=0)
    return ( 
        torch.LongTensor([t[0]-1 for t in b]), # labels
        torch.cat(x), # text 
        o
    )

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)

Note, that unlike in all previous examples, our network now accepts two parameters: data vector and offset vector, which are of different sizes. Similarly, our data loader also provides us with 3 values instead of 2: both text and offset vectors are provided as features. Therefore, we need to slightly adjust our training function to take care of that:


In [7]:
net = EmbedClassifier(vocab_size,32,len(classes)).to(device)

def train_epoch_emb(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.CrossEntropyLoss(),epoch_size=None, report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
    loss_fn = loss_fn.to(device)
    net.train()
    total_loss,acc,count,i = 0,0,0,0
    for labels,text,off in dataloader:
        optimizer.zero_grad()
        labels,text,off = labels.to(device), text.to(device), off.to(device)
        out = net(text, off)
        loss = loss_fn(out,labels) #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss+=loss
        _,predicted = torch.max(out,1)
        acc+=(predicted==labels).sum()
        count+=len(labels)
        i+=1
        if i%report_freq==0:
            print(f"{count}: acc={acc.item()/count}")
        if epoch_size and count>epoch_size:
            break
    return total_loss.item()/count, acc.item()/count


train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)

3200: acc=0.6153125
6400: acc=0.6615625
9600: acc=0.6932291666666667
12800: acc=0.715078125
16000: acc=0.7270625
19200: acc=0.7382291666666667
22400: acc=0.7486160714285715


(22.771553103007037, 0.7551983365323096)

## Semantic Embeddings: Word2Vec

In our previous example, the model's embedding layer learned to map words to vector representations, but these representations lacked significant semantic meaning. It would be ideal to learn vector representations where similar words or synonyms correspond to vectors that are close to each other based on some vector distance (e.g., Euclidean distance).

To achieve this, we need to pre-train our embedding model on a large text corpus in a specific way. One of the earliest methods for training semantic embeddings is called [Word2Vec](https://en.wikipedia.org/wiki/Word2vec). It relies on two main architectures to produce distributed representations of words:

 - **Continuous bag-of-words** (CBoW) — In this architecture, the model is trained to predict a word based on its surrounding context. Given the n-gram $(W_{-2},W_{-1},W_0,W_1,W_2)$, the model's goal is to predict $W_0$ using $(W_{-2},W_{-1},W_1,W_2)$.
 - **Continuous skip-gram** — This approach is the opposite of CBoW. The model uses the surrounding window of context words to predict the current word.

CBoW is faster, while skip-gram is slower but performs better at representing infrequent words.

![Image showing both CBoW and Skip-Gram algorithms to convert words to vectors.](../../../../../translated_images/example-algorithms-for-converting-words-to-vectors.fbe9207a726922f6f0f5de66427e8a6eda63809356114e28fb1fa5f4a83ebda7.en.png)

To experiment with Word2Vec embeddings pre-trained on the Google News dataset, we can use the **gensim** library. Below, we find the words most similar to 'neural.'

> **Note:** When you first create word vectors, downloading them may take some time!


In [8]:
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')

In [9]:
for w,p in w2v.most_similar('neural'):
    print(f"{w} -> {p}")

neuronal -> 0.7804799675941467
neurons -> 0.7326500415802002
neural_circuits -> 0.7252851724624634
neuron -> 0.7174385190010071
cortical -> 0.6941086649894714
brain_circuitry -> 0.6923246383666992
synaptic -> 0.6699118614196777
neural_circuitry -> 0.6638563275337219
neurochemical -> 0.6555314064025879
neuronal_activity -> 0.6531826257705688


We can also compute vector embeddings from the word, to be used in training classification model (we only show first 20 components of the vector for clarity):


In [10]:
w2v.word_vec('play')[:20]

array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,
        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,
       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,
       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],
      dtype=float32)

Great thing about semantical embeddings is that you can manipulate vector encoding to change the semantics. For example, we can ask to find a word, whose vector representation would be as close as possible to words *king* and *woman*, and as far away from the word *man*:


In [10]:
w2v.most_similar(positive=['king','woman'],negative=['man'])[0]

('queen', 0.7118192911148071)

Both CBoW and Skip-Grams are "predictive" embeddings, meaning they only consider local contexts. Word2Vec does not utilize global context.

**FastText** builds on Word2Vec by learning vector representations for each word as well as the character n-grams within each word. These representations are then averaged into a single vector during each training step. While this adds significant computational overhead during pre-training, it allows word embeddings to capture sub-word information.

Another method, **GloVe**, uses the concept of a co-occurrence matrix and applies neural techniques to decompose the matrix into more expressive and non-linear word vectors.

You can experiment with the example by switching the embeddings to FastText and GloVe, as gensim supports several different word embedding models.


## Using Pre-Trained Embeddings in PyTorch

We can adapt the example above to initialize the matrix in our embedding layer with semantic embeddings, such as Word2Vec. It's important to note that the vocabularies of the pre-trained embeddings and our text corpus will probably not align perfectly, so we'll assign random values to the weights of any missing words:


In [11]:
embed_size = len(w2v.get_vector('hello'))
print(f'Embedding size: {embed_size}')

net = EmbedClassifier(vocab_size,embed_size,len(classes))

print('Populating matrix, this will take some time...',end='')
found, not_found = 0,0
for i,w in enumerate(vocab.get_itos()):
    try:
        net.embedding.weight[i].data = torch.tensor(w2v.get_vector(w))
        found+=1
    except:
        net.embedding.weight[i].data = torch.normal(0.0,1.0,(embed_size,))
        not_found+=1

print(f"Done, found {found} words, {not_found} words missing")
net = net.to(device)

Embedding size: 300
Populating matrix, this will take some time...Done, found 41080 words, 54732 words missing


Now let's train our model. Note that the time it takes to train the model is significantly larger than in the previous example, due to larger embedding layer size, and thus much higher number of parameters. Also, because of this, we may need to train our model on more examples if we want to avoid overfitting.


In [12]:
train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)

3200: acc=0.6359375
6400: acc=0.68109375
9600: acc=0.7067708333333333
12800: acc=0.723671875
16000: acc=0.73625
19200: acc=0.7463541666666667
22400: acc=0.7560714285714286


(214.1013875559821, 0.7626759436980166)

In our case, we do not observe a significant improvement in accuracy, which is likely due to the differences in vocabularies.  
To address the issue of differing vocabularies, we can consider one of the following solutions:  
* Re-train the word2vec model using our own vocabulary  
* Load our dataset using the vocabulary from the pre-trained word2vec model. The vocabulary used for loading the dataset can be specified during the loading process.  

The second approach seems simpler, especially since PyTorch's `torchtext` framework provides built-in support for embeddings. For instance, we can create a GloVe-based vocabulary in the following way:  


In [14]:
vocab = torchtext.vocab.GloVe(name='6B', dim=50)

100%|█████████▉| 399999/400000 [00:15<00:00, 25411.14it/s]


Loaded vocabulary has the following basic operations:
* The `vocab.stoi` dictionary lets us convert a word into its dictionary index.
* `vocab.itos` does the reverse—it converts a number back into a word.
* `vocab.vectors` is the array of embedding vectors, so to get the embedding of a word `s`, we use `vocab.vectors[vocab.stoi[s]]`.

Here’s an example of manipulating embeddings to demonstrate the equation **kind-man+woman = queen** (I had to adjust the coefficient slightly to make it work):


In [15]:
# get the vector corresponding to kind-man+woman
qvec = vocab.vectors[vocab.stoi['king']]-vocab.vectors[vocab.stoi['man']]+1.3*vocab.vectors[vocab.stoi['woman']]
# find the index of the closest embedding vector 
d = torch.sum((vocab.vectors-qvec)**2,dim=1)
min_idx = torch.argmin(d)
# find the corresponding word
vocab.itos[min_idx]

'queen'

To train the classifier using those embeddings, we first need to encode our dataset using GloVe vocabulary:


In [16]:
def offsetify(b):
    # first, compute data tensor from all sequences
    x = [torch.tensor(encode(t[1],voc=vocab)) for t in b] # pass the instance of vocab to encode function!
    # now, compute the offsets by accumulating the tensor of sequence lengths
    o = [0] + [len(t) for t in x]
    o = torch.tensor(o[:-1]).cumsum(dim=0)
    return ( 
        torch.LongTensor([t[0]-1 for t in b]), # labels
        torch.cat(x), # text 
        o
    )

As we have seen above, all vector embeddings are stored in `vocab.vectors` matrix. It makes it super-easy to load those weights into weights of embedding layer using simple copying:


In [17]:
net = EmbedClassifier(len(vocab),len(vocab.vectors[0]),len(classes))
net.embedding.weight.data = vocab.vectors
net = net.to(device)

Now let's train our model and see if we get better results:


In [18]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)
train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)

3200: acc=0.6271875
6400: acc=0.68078125
9600: acc=0.7030208333333333
12800: acc=0.71984375
16000: acc=0.7346875
19200: acc=0.7455729166666667
22400: acc=0.7529464285714286


(35.53972978646833, 0.7575175943698017)

One of the reasons we are not seeing significant increase in accuracy is due to the fact that some words from our dataset are missing in the pre-trained GloVe vocabulary, and thus they are essentially ignored. To overcome this fact, we can train our own embeddings on our dataset.


## Contextual Embeddings

One major limitation of traditional pretrained embedding representations like Word2Vec is the issue of word sense disambiguation. While pretrained embeddings can capture some of the meaning of words in context, all possible meanings of a word are encoded into the same embedding. This can create challenges for downstream models, as many words, such as the word 'play,' have different meanings depending on the context in which they are used.

For example, the word 'play' has quite different meanings in these two sentences:
- I went to a **play** at the theater.
- John wants to **play** with his friends.

The pretrained embeddings mentioned above represent both meanings of the word 'play' in the same embedding. To address this limitation, we need to create embeddings based on the **language model**, which is trained on a large corpus of text and *understands* how words can be used in different contexts. While discussing contextual embeddings is beyond the scope of this tutorial, we will revisit them when we talk about language models in the next unit.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
