In [None]:
%matplotlib inline

Source: [https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [None]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7e386c65bd70>

In [None]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.




In [None]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
[520.4616117477417, 517.8069741725922, 515.1721034049988, 512.5561964511871, 509.95678639411926, 507.3753750324249, 504.81008219718933, 502.2583179473877, 499.72035217285156, 497.1941342353821]


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.




In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import string

In [None]:
# First finish the exercise as follows:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = torch.sum(self.embeddings(inputs), dim=0).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

# create your model and train.  here are some functions to help you make
# the data ready for use by your module

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)  # example

# Training the model
losses = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size, embedding_dim=50).to(device)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(12):
    total_loss = 0.0
    for context, target in data:
        # Convert the words into integer indices and wrap them in tensors
        context_idxs = make_context_vector(context, word_to_ix).to(device)
        target_idxs = torch.tensor([word_to_ix[target]], dtype=torch.long).to(device)

        model.zero_grad()

        # Run the forward pass
        log_probs = model(context_idxs)
        # Compute loss function
        loss = loss_function(log_probs, target_idxs)

        # Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {total_loss}')
    losses.append(total_loss)

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]
Epoch 1, Loss: 231.3255910873413
Epoch 2, Loss: 225.67047357559204
Epoch 3, Loss: 220.1729793548584
Epoch 4, Loss: 214.811208486557
Epoch 5, Loss: 209.58325743675232
Epoch 6, Loss: 204.4802279472351
Epoch 7, Loss: 199.4764038324356
Epoch 8, Loss: 194.55326175689697
Epoch 9, Loss: 189.7092010974884
Epoch 10, Loss: 184.94836127758026
Epoch 11, Loss: 180.2514146566391
Epoch 12, Loss: 175.6242400407791


**Part 1 - Train your CBOW embeddings for both datasets**

In [None]:
# Download the dataset
!gdown 1foE1JuZJeu5E_4qVge9kExzhvF32teuF # tripadvisor_hotel_reviews_reduced.csv
!gdown 13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75 # scifi_reduced.txt

Downloading...
From: https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF
To: /content/tripadvisor_hotel_reviews_reduced.csv
100% 7.36M/7.36M [00:00<00:00, 93.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /content/scifi_reduced.txt
100% 43.1M/43.1M [00:00<00:00, 151MB/s]


In [None]:
# Configuration
class Config:
    BATCH_SIZE = 64
    EMBEDDING_DIM = 50
    LEARNING_RATE = 0.01
    EPOCHS = {'model_1': 12, 'model_2': 12, 'model_3': 3}


# Preprocessing and Dataset Preparation
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text, remove_stopwords=True):
    # Tokenize, remove punctuation and stopwords, handle casing
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha()]  # Lowercasing and removing punctuation
    if remove_stopwords:
      words = [word for word in words if word not in stopwords.words('english')]
    return ' '.join(words)


def process_chunk(chunk, buffer=''):
    """
    Process a chunk of text, returning the leftover text to be processed.
    """
    text = buffer + chunk
    sentences = sent_tokenize(text)

    # The last sentence might be incomplete; keep it in the buffer
    buffer = sentences.pop() if sentences and not text.endswith(sentences[-1]) else ''

    processed_sentences = [preprocess_text(sentence) for sentence in sentences]

    return processed_sentences, buffer


def prepare_dataset_and_vocab_hotel(filename, preprocess_func, context_size):
    # Create a vocabulary
    vocab = set()
    context_target_pairs = []
    all_tokens = []

    df = pd.read_csv(filename)
    for text in df['Review']:
        processed_text = preprocess_func(text)
        words = processed_text.split()
        all_tokens.extend(words)
        vocab.update(words)

        for i in range(context_size, len(words) - context_size):
            context = words[i - context_size:i] + words[i + 1:i + context_size + 1]
            target = words[i]
            context_target_pairs.append((context, target))

    word_to_ix = {word: i for i, word in enumerate(vocab)}
    vocab_size = len(vocab)

    return vocab_size, word_to_ix, context_target_pairs, all_tokens

def prepare_dataset_and_vocab_scifi(filename, context_size, chunk_size=1024):
    """
    Process a large text file without loading it entirely into memory.
    """
    # Create a vocabulary
    vocab = set()
    context_target_pairs = []
    all_tokens = []

    buffer = ''
    with open(filename, 'r') as file:
        while True:
            chunk = file.read(chunk_size)
            if not chunk:
                break
            processed_sentences, buffer = process_chunk(chunk, buffer)

            for sentence in processed_sentences:
              words = sentence.split()
              all_tokens.extend(words)
              vocab.update(words)

              # Create context-target pairs
              for i in range(context_size, len(words) - context_size):
                    context = words[i - context_size:i] + words[i + 1:i + context_size + 1]
                    target = words[i]
                    context_target_pairs.append((context, target))

    # Process any remaining text in the buffer
    if buffer:
        processed_sentences, _ = process_chunk('', buffer)
        for sentence in processed_sentences:
            words = sentence.split()
            all_tokens.extend(words)
            vocab.update(words)

            # Create context-target pairs
            for i in range(context_size, len(words) - context_size):
                context = words[i - context_size:i] + words[i + 1:i + context_size + 1]
                target = words[i]
                context_target_pairs.append((context, target))

    word_to_ix = {word: i for i, word in enumerate(vocab)}
    vocab_size = len(vocab)

    return vocab_size, word_to_ix, context_target_pairs, all_tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Define the CBOW model
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = torch.sum(self.embeddings(inputs), dim=1)  # batch processing
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


# Define the dataset Class
class ContextTargetDataset(Dataset):
    def __init__(self, context_target_pairs, word_to_ix):
        self.context_target_pairs = context_target_pairs
        self.word_to_ix = word_to_ix

    def __len__(self):
        return len(self.context_target_pairs)

    def __getitem__(self, idx):
        context, target = self.context_target_pairs[idx]
        context_idxs = [self.word_to_ix[w] for w in context]
        target_idx = self.word_to_ix[target]
        return torch.tensor(context_idxs, dtype=torch.long), torch.tensor([target_idx], dtype=torch.long)

In [None]:
# Define training Function
def train_model(vocab_size, word_to_ix, context_target_pairs, model_name, device):
    dataset = ContextTargetDataset(context_target_pairs, word_to_ix)
    dataloader = DataLoader(dataset, batch_size=Config.BATCH_SIZE, shuffle=True)

    model = CBOW(vocab_size, Config.EMBEDDING_DIM).to(device)
    losses = []
    loss_function = nn.NLLLoss()
    optimizer = optim.SGD(model.parameters(), lr=Config.LEARNING_RATE)

    for epoch in range(Config.EPOCHS[model_name]):
        total_loss = 0.0
        for context, target in dataloader:
            # Convert the words into integer indices and wrap them in tensors
            context, target = context.to(device), target.to(device)

            model.zero_grad()

            # Run the forward pass
            log_probs = model(context)
            # Compute loss function
            loss = loss_function(log_probs, target.squeeze(1))

            # Do the backward pass and update the gradient
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        epoch_loss = total_loss / len(dataloader)
        print(f'Epoch {epoch + 1}, Average Loss per Batch: {epoch_loss}')
        losses.append(total_loss)

    # Save the model's state_dict
    torch.save(model.state_dict(), f"{model_name}.pth")

In [None]:
# GPU device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# Model 1: CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset
vocab_size_1, word_to_ix_1, pairs_1, all_tokens_hotel = prepare_dataset_and_vocab_hotel('tripadvisor_hotel_reviews_reduced.csv', preprocess_text, 2)
train_model(vocab_size_1, word_to_ix_1, pairs_1, 'model_1', device)

Epoch 1, Average Loss per Batch: 8.64738132212795
Epoch 2, Average Loss per Batch: 7.839048518669808
Epoch 3, Average Loss per Batch: 7.666474648754475
Epoch 4, Average Loss per Batch: 7.57498645341329
Epoch 5, Average Loss per Batch: 7.513791409353752
Epoch 6, Average Loss per Batch: 7.4675025173163005
Epoch 7, Average Loss per Batch: 7.429744796732725
Epoch 8, Average Loss per Batch: 7.397245557732501
Epoch 9, Average Loss per Batch: 7.368372903435917
Epoch 10, Average Loss per Batch: 7.34189756465497
Epoch 11, Average Loss per Batch: 7.317531581854416
Epoch 12, Average Loss per Batch: 7.29453575607747


In [None]:
# Model 2: CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset
vocab_size_2, word_to_ix_2, pairs_2, _ = prepare_dataset_and_vocab_hotel('tripadvisor_hotel_reviews_reduced.csv', preprocess_text, 5)
train_model(vocab_size_2, word_to_ix_2, pairs_2, 'model_2', device)

Epoch 1, Average Loss per Batch: 8.569523979854289
Epoch 2, Average Loss per Batch: 7.904581894052738
Epoch 3, Average Loss per Batch: 7.7478905095551385
Epoch 4, Average Loss per Batch: 7.662141035501583
Epoch 5, Average Loss per Batch: 7.604164073121608
Epoch 6, Average Loss per Batch: 7.560179713850058
Epoch 7, Average Loss per Batch: 7.52452746294436
Epoch 8, Average Loss per Batch: 7.493663848424601
Epoch 9, Average Loss per Batch: 7.466604232535658
Epoch 10, Average Loss per Batch: 7.4417784889626235
Epoch 11, Average Loss per Batch: 7.418708803757753
Epoch 12, Average Loss per Batch: 7.397085127108921


In [None]:
# Model 3: CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset
vocab_size_3, word_to_ix_3, pairs_3, all_tokens_scifi = prepare_dataset_and_vocab_scifi('scifi_reduced.txt', 2)
train_model(vocab_size_3, word_to_ix_3, pairs_3, 'model_3', device)

Epoch 1, Average Loss per Batch: 10.16968716600376
Epoch 2, Average Loss per Batch: 9.25610216266019
Epoch 3, Average Loss per Batch: 9.040848910725176


Question: Are predictions made by the model sensitive towards the context size?

In [None]:
# Load the trained models
def load_model(model_path, vocab_size, embedding_dim, device):
    model = CBOW(vocab_size, embedding_dim)
    model.load_state_dict(torch.load(model_path))
    model.to(device)
    model.eval()  # Set the model to evaluation mode
    return model

# Load CBOW2 and CBOW5 models
model_1 = load_model('model_1.pth', vocab_size_1, Config.EMBEDDING_DIM, device)
model_2 = load_model('model_2.pth', vocab_size_2, Config.EMBEDDING_DIM, device)

# Prediction function
def predict(context, model, word_to_ix, device):
    context_idxs = [word_to_ix.get(w) for w in context]  # Get indices for known words
    context_idxs = [idx for idx in context_idxs if idx is not None]  # Filter out None values (unknown words)

    if not context_idxs:  # Check if context is empty after filtering
        return None  # or handle this case as you see fit

    context_tensor = torch.tensor(context_idxs, dtype=torch.long).unsqueeze(0).to(device)
    with torch.no_grad():
        log_probs = model(context_tensor)
    return log_probs.exp().argmax(1).item()

# Creating reverse dictionary
ix_to_word_1 = {ix: word for word, ix in word_to_ix_1.items()}
ix_to_word_2 = {ix: word for word, ix in word_to_ix_2.items()}

# Example usage
context_words_CBOW2 = [['booked', 'hotel', 'advisor', 'pleasantly'], # trip
                       ['break', 'barcelona', 'hotel', 'modern']] # universal
context_words_CBOW5 = [['great', 'hotel', 'money', 'booked', 'hotel', 'advisor', 'pleasantly', 'surprised', 'room', 'roof'], # trip
                       ['just', 'returned', 'excellent', 'break', 'barcelona', 'hotel', 'modern', 'decor', 'spotlessly', 'clean']] # universal
for context_word in context_words_CBOW2:
    predicted_index_1 = predict(context_word, model_1, word_to_ix_1, device)
    predicted_word_1 = ix_to_word_1[predicted_index_1]
    print(f"CBOW2 Predicted word for {context_word}: {predicted_word_1}")
for context_word in context_words_CBOW5:
    predicted_index_2 = predict(context_word, model_2, word_to_ix_2, device)
    predicted_word_2 = ix_to_word_2[predicted_index_2]
    print(f"CBOW5 Predicted word for {context_word}: {predicted_word_2}")


CBOW2 Predicted word for ['booked', 'hotel', 'advisor', 'pleasantly']: trip
CBOW2 Predicted word for ['break', 'barcelona', 'hotel', 'modern']: hotel
CBOW5 Predicted word for ['great', 'hotel', 'money', 'booked', 'hotel', 'advisor', 'pleasantly', 'surprised', 'room', 'roof']: hotel
CBOW5 Predicted word for ['just', 'returned', 'excellent', 'break', 'barcelona', 'hotel', 'modern', 'decor', 'spotlessly', 'clean']: hotel


**Part 2 - Test your embeddings**

In [None]:
import nltk
from collections import Counter
import torch
import torch.nn as nn

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
def get_closest_word(word, vocab_size, word_to_ix, topn=5):
    word_distance = []
    index_to_word = {index: word for word, index in word_to_ix.items()}

    model = CBOW(vocab_size, Config.EMBEDDING_DIM).to(device)
    emb = model.embeddings
    pdist = nn.PairwiseDistance()
    i = word_to_ix[word]
    lookup_tensor_i = torch.tensor([i], dtype=torch.long).to(device)
    v_i = emb(lookup_tensor_i)

    for j in range(vocab_size):
        if j != i:
            lookup_tensor_j = torch.tensor([j], dtype=torch.long).to(device)
            v_j = emb(lookup_tensor_j)
            word_distance.append((index_to_word[j], float(pdist(v_i, v_j))))

    word_distance.sort(key=lambda x: x[1])
    return word_distance[:topn]


In [None]:
# Find top and least frequent words for each category
def find_top_least_frequent_words(tokens):
    tokens_with_pos = nltk.pos_tag(tokens)

    # Find nouns/verbs/adjectives
    adjectives = [word for word, pos in tokens_with_pos if pos in ['JJ', 'JJR', 'JJS']]
    nouns = [word for word, pos in tokens_with_pos if pos in ['NN']]
    verbs = [word for word, pos in tokens_with_pos if pos in ['VBZ', 'VBP', 'VBN', 'VBG', 'VBD', 'VB']]

    # Count occurrences
    adjective_counts = Counter(adjectives)
    noun_counts = Counter(nouns)
    verb_counts = Counter(verbs)

    # Extract top 3 and least frequent 10
    most_least_frequent = {
        "Most frequent adjectives": adjective_counts.most_common(3),
        "Least frequent adjectives": adjective_counts.most_common()[:-11:-1],
        "Most frequent nouns": noun_counts.most_common(3),
        "Least frequent nouns": noun_counts.most_common()[:-11:-1],
        "Most frequent verbs": verb_counts.most_common(3),
        "Least frequent verbs": verb_counts.most_common()[:-11:-1]
    }

    return most_least_frequent

# Analyze and print the results for both files
print("Hotel Reviews Dataset:")
print(find_top_least_frequent_words(all_tokens_hotel))

print("\nSci-Fi Story Dataset:")
print(find_top_least_frequent_words(all_tokens_scifi))

Hotel Reviews Dataset:
{'Most frequent adjectives': [('great', 10263), ('good', 8352), ('nice', 5683)], 'Least frequent adjectives': [('flying', 1), ('remedy', 1), ('hairl', 1), ('stainless', 1), ('windsor', 1), ('serivce', 1), ('unwrap', 1), ('flustered', 1), ('phantastic', 1), ('uncleared', 1)], 'Most frequent nouns': [('hotel', 22944), ('room', 16561), ('staff', 7963)], 'Least frequent nouns': [('yellow', 1), ('clump', 1), ('safar', 1), ('sme', 1), ('consolation', 1), ('dumonde', 1), ('canteen', 1), ('utilization', 1), ('netscape', 1), ('dinosaur', 1)], 'Most frequent verbs': [('stayed', 4288), ('got', 3012), ('went', 2253)], 'Least frequent verbs': [('clump', 1), ('villages', 1), ('prepaying', 1), ('susie', 1), ('mao', 1), ('rc', 1), ('segregated', 1), ('spas', 1), ('nonsmoking', 1), ('prompting', 1)]}

Sci-Fi Story Dataset:
{'Most frequent adjectives': [('little', 8066), ('good', 7341), ('new', 7289)], 'Least frequent adjectives': [('hearl', 1), ('prerevolutionary', 1), ('reconstr

In [None]:
# Define the words for each dataset
test_words_hotel = ['great', 'good', 'phantastic', 'hotel', 'room', 'utilization', 'got', 'went', 'prepaying' ]
test_words_scifi = ['little', 'new', 'prerevolutionary', 'time', 'man', 'zoos', 'said', 'know', 'acticing']

# Print closest words for the hotel reviews dataset
print("Closest words in the hotel reviews dataset:")
for word in test_words_hotel:
    closest_words = get_closest_word(word, topn=5, vocab_size=vocab_size_1, word_to_ix=word_to_ix_1)
    print(f"'{word}': {closest_words}")

# Print closest words for the Sci-fi dataset
print("\nClosest words in the Sci-fi dataset:")
for word in test_words_scifi:
    closest_words = get_closest_word(word, topn=5, vocab_size=vocab_size_3, word_to_ix=word_to_ix_3)
    print(f"'{word}': {closest_words}")


Closest words in the hotel reviews dataset:
'great': [('condition', 6.2644853591918945), ('plummeted', 6.444531440734863), ('perfic', 6.513710975646973), ('disconcerted', 6.576338291168213), ('muscle', 6.623223304748535)]
'good': [('eurorstars', 5.879917144775391), ('flames', 6.211822032928467), ('reat', 6.5279951095581055), ('barmaid', 6.540792942047119), ('callout', 6.583652496337891)]
'phantastic': [('maddening', 6.302803039550781), ('tito', 6.546751499176025), ('modernist', 6.70634651184082), ('guava', 6.718997478485107), ('warmly', 6.726440906524658)]
'hotel': [('corn', 6.549899578094482), ('smitten', 6.597479820251465), ('tropicana', 6.830557823181152), ('wisteria', 6.850427627563477), ('crepe', 6.864654541015625)]
'room': [('murmured', 7.4121527671813965), ('champgane', 7.756467819213867), ('verandah', 7.849087715148926), ('champaine', 7.875658988952637), ('inspector', 7.998842239379883)]
'utilization': [('foreground', 6.802746295928955), ('maniac', 6.8405375480651855), ('bliste

Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the
Sci-fi-based embeddings.

In [None]:
words_to_find = ['good', 'get']
results = {}

# Find neighbors for each word using both datasets
for word in words_to_find:
    neighbors_hotel = get_closest_word(word, topn=5, vocab_size=vocab_size_1, word_to_ix=word_to_ix_1)
    neighbors_scifi = get_closest_word(word, topn=5, vocab_size=vocab_size_3, word_to_ix=word_to_ix_3)

    results[word] = {
        'hotel_review_based': neighbors_hotel,
        'sci_fi_based': neighbors_scifi
    }

# Print the results
for word, neighbors in results.items():
    print(f"Word: '{word}'")
    print("Hotel Review-Based Embeddings:")
    for neighbor, distance in neighbors['hotel_review_based']:
        print(f"{neighbor}: {distance:.4f}")
    print("\nSci-Fi-Based Embeddings:")
    for neighbor, distance in neighbors['sci_fi_based']:
        print(f"{neighbor}: {distance:.4f}")
    print("\n")


Word: 'good'
Hotel Review-Based Embeddings:
mirador: 6.1486
streetnoise: 6.2175
multitude: 6.3897
sofo: 6.4843
dinging: 6.4877

Sci-Fi-Based Embeddings:
ottawa: 6.7418
basso: 6.7445
zoos: 6.8327
spile: 6.8351
imposingly: 6.8814


Word: 'get'
Hotel Review-Based Embeddings:
variaty: 6.3450
velcro: 6.3659
kinkos: 6.5226
activated: 6.5887
sketchiest: 6.6011

Sci-Fi-Based Embeddings:
sidle: 6.8170
horriblefrenzy: 6.8345
anmore: 6.8713
mathsar: 6.9113
dements: 6.9704


