<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex2/Exercise_2_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%matplotlib inline

### Source: [link](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [1]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
torch.manual_seed(1)

<torch._C.Generator at 0x7f0da03cabd0>

In [2]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.

In [3]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss() # Negative Log Likelihood Loss
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    print("Loss in Epoch {ep}: {l}".format(ep=epoch, l=np.round(total_loss, 2))) # The loss decreased every iteration over the training data!
    losses.append(total_loss)

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
Loss in Epoch 0: 527.55
Loss in Epoch 1: 524.94
Loss in Epoch 2: 522.34
Loss in Epoch 3: 519.76
Loss in Epoch 4: 517.2
Loss in Epoch 5: 514.66
Loss in Epoch 6: 512.13
Loss in Epoch 7: 509.62
Loss in Epoch 8: 507.12
Loss in Epoch 9: 504.64


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.


## Exercise Layout
### 1. <u>Training CBOW Embeddings</u>
1.1) Implement a CBOW Model by completing ```class CBOW(nn.Module)``` and train it on ```raw_text```.    

1.2) Load Datasets ```tripadvisor_hotel_reviews_reduced.csv``` and ```scifi_reduced.txt```.     

1.3) Decide preprocessing steps by completing the function ```def custom_preprocess()```. Describe your decisions. Note that it's your choice to create different preprocessing functions for hotel reviews and scifi datasets or use the same preprocessing function.             

1.4) Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.   

1.5) Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset. Are predictions made by the model sensitive towards the context size?
     
1.6) Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset.  


### 2. <u>Test your Embeddings</u>
Note - Do the following for CBOW2, and optionally for CBOW5

2.1) For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model. List them in your report and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   

2.2) Do the same for Sci-Fi dataset.   

2.3) How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.   

2.4) Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings. Do they have different neighbours? If yes, can you reason why?    

2.5) What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?   


### Tips

1. Switch from CPU to a GPU instance after you have confirmed that your training procedure is working correctly.
2. You can always save your intermediate results (embeddings, preprocessed dataset, model, etc.) in your google drive via colab



### 1.1 Create a CBOW Model by completing ```class CBOW(nn.Module)``` and test it on ```raw_text```
Implement CBOW in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.

# Disclaimer, only run this code, when you have engough RAM(at least 32 GB). Else the preprocessing of the sci-fi dataset wont work and your program will crash!!

In [3]:
import torch
import torch.nn as nn

def get_bag_of_words(data_input, context_size):

    words = [word for sentence in data_input for word in sentence]
    vocab = sorted(set(words))
    vocab_size = len(vocab)
    word_to_ix = {word: i for i, word in enumerate(vocab)}
    
    data = []
    for sentence in data_input:
        for i in range(context_size, len(sentence) - context_size):
            context = [sentence[i + j] for j in range(-context_size, context_size + 1) if j != 0]
            target = sentence[i]
            data.append((context, target))
    print("get_bag_of_words", data[:5])
    return data, word_to_ix, vocab_size

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


# Pick best device (CUDA > MPS > CPU)
device = (
    torch.device("cuda") if torch.cuda.is_available()
    else torch.device("mps") if torch.backends.mps.is_available()
    else torch.device("cpu")
)
print("Using device:", device)

EMBEDDING_DIM = 100

class CBOW(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, EMBEDDING_DIM)
        self.linear = nn.Linear(EMBEDDING_DIM, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        context_embed = embeds.mean(dim=1)
        out = self.linear(context_embed)
        return out






Using device: cuda


### 1.2 Load Datasets

In [5]:
### Load Datasets tripadvisor_hotel_reviews_reduced.csv and scifi_reduced.txt

!gdown 1foE1JuZJeu5E_4qVge9kExzhvF32teuF # For Hotel Reviews
!gdown 13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75 # For Scifi-Text

Downloading...
From: https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF
To: /home/work/Documents/GitHub/ML4NLP1/exercises/ex2/tripadvisor_hotel_reviews_reduced.csv
100%|██████████████████████████████████████| 7.36M/7.36M [00:00<00:00, 10.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /home/work/Documents/GitHub/ML4NLP1/exercises/ex2/scifi_reduced.txt
100%|██████████████████████████████████████| 43.1M/43.1M [00:03<00:00, 13.9MB/s]


### 1.3 Preprocess Datasets
### 🗒❓ Describe your decisions for preprocessing the datasets
See the lab report.

In [None]:
### Complete the preprocessing function and apply it to the datasets
import spacy

nlp = spacy.load("en_core_web_sm", disable=["ner"]) 
# keep negations
stop_keep = {"not", "no", "n't"}
custom_stops = (nlp.Defaults.stop_words - stop_keep)

def custom_preprocess(dataset):

    processed = []
    for doc in nlp.pipe(dataset, batch_size=128, n_process=1):
        tokens = [
            t.lemma_.lower()
            for t in doc
            if not t.is_punct
            and not t.is_space
            and t.lemma_ != ""
            and (t.is_alpha or t.like_num)
        ]
        if tokens:
            processed.append(tokens)  
    print("custom_preprocess", processed[:5])
    return processed
    



In [7]:
#Preprocessing Orchestrator hotel_reviews context_size 2
import pandas as pd
import torch
csv_path = "./tripadvisor_hotel_reviews_reduced.csv"

df = pd.read_csv(csv_path)
#df = df.head(500) #only for testing purposes
data_input = df['Review'].astype(str).tolist()

context_size = 2
preprocessed_data = custom_preprocess(data_input)
data, word_to_ix, vocab_size = get_bag_of_words(preprocessed_data, context_size)
completely_processed_x = []
completely_processed_y = []
for bag_of_context_word, target_word in data:
    completely_processed_x.append(make_context_vector(bag_of_context_word, word_to_ix))
    completely_processed_y.append(word_to_ix[target_word])

input_x = torch.stack(completely_processed_x)
input_y = torch.tensor(completely_processed_y, dtype=torch.long)
print(completely_processed_x[:5])
print(completely_processed_y[:5])

print("Example pair:")
print("Context tensor:", input_x[0])
print("Target index:", input_y[0])


custom_preprocess [['fantastic', 'service', 'large', 'hotel', 'cater', 'business', 'corporate', 'serve', 'provide', 'well', 'wife', 'nothing', 'short', 'room', 'upgrade', 'superior', 'room', 'overlook', 'harbour', 'marina', 'large', 'window', '50', 'foot', 'length', 'anniversary', 'bottle', 'champagne', 'send', 'chocolate', 'compliment', 'management', 'expensive', 'do', 'not', 'regret', 'moment', 'choice', 'hotel', 'highly', 'recommend', 'exclusive', 'hotel', 'break', 'pamper'], ['great', 'hotel', 'modern', 'hotel', 'good', 'location', 'locate', 'just', '2', 'minute', 'metro', 'sation', 'stop', 'airport', 'clean', 'equiped', 'room', 'good', 'soundproofing', 'ask', 'overlook', 'central', 'courtyard', 'hotel', 'main', 'road', 'bottled', 'water', 'available', 'free', 'room', 'mini', 'bar', 'breakfast', 'superb', 'want', '10', 'euro', 'cold', 'buffet', '14', 'euros', 'hot', 'food'], ['3', 'star', 'plus', 'glasgowjust', 'get', '30th', 'november', '4', 'day', 'visit', 'great', 'good', 'value

### 1.4 Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.

In [None]:
import os
import torch
import torch.nn as nn

input_x = input_x.long()
input_y = input_y.long()

model = CBOW(vocab_size).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model.train()
batch_size = 64
num_epochs = 250              # upper limit
patience = 5                  # stop after 5 epochs without improvement
min_delta = 1e-4              # required improvement to reset patience

N = input_x.size(0)
best_loss = float("inf")
epochs_no_improve = 0

for epoch in range(1, num_epochs + 1):
    total_loss = 0.0

    for start in range(0, N, batch_size):
        end = start + batch_size
        x_batch = input_x[start:end].to(device, non_blocking=True)
        y_batch = input_y[start:end].to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)
        logits = model(x_batch)
        loss = criterion(logits, y_batch)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x_batch.size(0)

    avg_loss = total_loss / N
    print(f"Epoch {epoch:3d} - loss: {avg_loss:.6f}")

    # --- early stopping check ---
    if best_loss - avg_loss > min_delta:
        best_loss = avg_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1
        if epochs_no_improve >= patience:
            print(f"Early stopping: no improvement for {patience} epochs.")
            break

# --- save final model --- Safed only weights, therefore rebuilding later...
model_name = "CBOW2_hotel.pt"
torch.save(model.state_dict(), model_name)
print(f"Model saved as {model_name}, best loss = {best_loss:.6f}")


Epoch   1 - loss: 7.026946
Epoch   2 - loss: 6.377628
Epoch   3 - loss: 6.185637
Epoch   4 - loss: 6.063917
Epoch   5 - loss: 5.972479
Epoch   6 - loss: 5.897673
Epoch   7 - loss: 5.833078
Epoch   8 - loss: 5.774929
Epoch   9 - loss: 5.720558
Epoch  10 - loss: 5.667643
Epoch  11 - loss: 5.613806
Epoch  12 - loss: 5.556654
Epoch  13 - loss: 5.495918
Epoch  14 - loss: 5.435517
Epoch  15 - loss: 5.380709
Epoch  16 - loss: 5.333974
Epoch  17 - loss: 5.295088
Epoch  18 - loss: 5.262750
Epoch  19 - loss: 5.235479
Epoch  20 - loss: 5.212067
Epoch  21 - loss: 5.191587
Epoch  22 - loss: 5.173331
Epoch  23 - loss: 5.156775
Epoch  24 - loss: 5.141595
Epoch  25 - loss: 5.127490
Epoch  26 - loss: 5.114306
Epoch  27 - loss: 5.101849
Epoch  28 - loss: 5.090077
Epoch  29 - loss: 5.078864
Epoch  30 - loss: 5.068203
Epoch  31 - loss: 5.057977
Epoch  32 - loss: 5.048221
Epoch  33 - loss: 5.038844
Epoch  34 - loss: 5.029870
Epoch  35 - loss: 5.021229
Epoch  36 - loss: 5.012900
Epoch  37 - loss: 5.004876
E

### 1.5 Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset.  

🗒❓ Are predictions made by the model sensitive towards the context size?
Answered in the report at the end of the notebook.

In [9]:
#Preprocessing Orchestrator hotel_reviews context_size 5
import pandas as pd
import torch
csv_path = "./tripadvisor_hotel_reviews_reduced.csv"

df = pd.read_csv(csv_path)
#df = df.head(500) #only for testing purposes
data_input = df['Review'].astype(str).tolist()

context_size = 5
preprocessed_data = custom_preprocess(data_input)
data, word_to_ix, vocab_size = get_bag_of_words(preprocessed_data, context_size)
completely_processed_x = []
completely_processed_y = []
for bag_of_context_word, target_word in data:
    completely_processed_x.append(make_context_vector(bag_of_context_word, word_to_ix))
    completely_processed_y.append(word_to_ix[target_word])

input_x = torch.stack(completely_processed_x)
input_y = torch.tensor(completely_processed_y, dtype=torch.long)
print(completely_processed_x[:5])
print(completely_processed_y[:5])

print("Example pair:")
print("Context tensor:", input_x[0])
print("Target index:", input_y[0])


custom_preprocess [['fantastic', 'service', 'large', 'hotel', 'cater', 'business', 'corporate', 'serve', 'provide', 'well', 'wife', 'nothing', 'short', 'room', 'upgrade', 'superior', 'room', 'overlook', 'harbour', 'marina', 'large', 'window', '50', 'foot', 'length', 'anniversary', 'bottle', 'champagne', 'send', 'chocolate', 'compliment', 'management', 'expensive', 'do', 'not', 'regret', 'moment', 'choice', 'hotel', 'highly', 'recommend', 'exclusive', 'hotel', 'break', 'pamper'], ['great', 'hotel', 'modern', 'hotel', 'good', 'location', 'locate', 'just', '2', 'minute', 'metro', 'sation', 'stop', 'airport', 'clean', 'equiped', 'room', 'good', 'soundproofing', 'ask', 'overlook', 'central', 'courtyard', 'hotel', 'main', 'road', 'bottled', 'water', 'available', 'free', 'room', 'mini', 'bar', 'breakfast', 'superb', 'want', '10', 'euro', 'cold', 'buffet', '14', 'euros', 'hot', 'food'], ['3', 'star', 'plus', 'glasgowjust', 'get', '30th', 'november', '4', 'day', 'visit', 'great', 'good', 'value

In [None]:
import os
import torch
import torch.nn as nn

input_x = input_x.long()
input_y = input_y.long()

model = CBOW(vocab_size).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model.train()
batch_size = 64
num_epochs = 250              # upper limit
patience = 5                  # stop after 5 epochs without improvement
min_delta = 1e-4              # required improvement to reset patience

N = input_x.size(0)
best_loss = float("inf")
epochs_no_improve = 0

for epoch in range(1, num_epochs + 1):
    total_loss = 0.0

    for start in range(0, N, batch_size):
        end = start + batch_size
        x_batch = input_x[start:end].to(device, non_blocking=True)
        y_batch = input_y[start:end].to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)
        logits = model(x_batch)
        loss = criterion(logits, y_batch)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x_batch.size(0)

    avg_loss = total_loss / N
    print(f"Epoch {epoch:3d} - loss: {avg_loss:.6f}")

    # --- early stopping check ---
    if best_loss - avg_loss > min_delta:
        best_loss = avg_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1
        if epochs_no_improve >= patience:
            print(f"Early stopping: no improvement for {patience} epochs.")
            break



model_name = "CBOW5_hotel.pt"  


if os.path.exists(model_name):
    os.remove(model_name)

# save the entire model Safed only weights, therefore rebuilding later...
torch.save(model.state_dict(), model_name)

print(f"Model saved as {model_name}")


Epoch   1 - loss: 7.154890
Epoch   2 - loss: 6.569460
Epoch   3 - loss: 6.392240
Epoch   4 - loss: 6.278913
Epoch   5 - loss: 6.193590
Epoch   6 - loss: 6.123602
Epoch   7 - loss: 6.062994
Epoch   8 - loss: 6.008406
Epoch   9 - loss: 5.957670
Epoch  10 - loss: 5.909108
Epoch  11 - loss: 5.861149
Epoch  12 - loss: 5.811895
Epoch  13 - loss: 5.758581
Epoch  14 - loss: 5.698367
Epoch  15 - loss: 5.632750
Epoch  16 - loss: 5.569044
Epoch  17 - loss: 5.513042
Epoch  18 - loss: 5.465967
Epoch  19 - loss: 5.426689
Epoch  20 - loss: 5.393552
Epoch  21 - loss: 5.365123
Epoch  22 - loss: 5.340245
Epoch  23 - loss: 5.318129
Epoch  24 - loss: 5.298183
Epoch  25 - loss: 5.279929
Epoch  26 - loss: 5.263052
Epoch  27 - loss: 5.247343
Epoch  28 - loss: 5.232531
Epoch  29 - loss: 5.218616
Epoch  30 - loss: 5.205295
Epoch  31 - loss: 5.192680
Epoch  32 - loss: 5.180502
Epoch  33 - loss: 5.168922
Epoch  34 - loss: 5.157722
Epoch  35 - loss: 5.147005
Epoch  36 - loss: 5.136502
Epoch  37 - loss: 5.126537
E

### 1.6 Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset

In [None]:
#Breaks up the loooong string into managable subparts for further processing
def chunk_text_smart(
    text: str,
    max_chars: int = 50_000,
    boundary_window: int = 1_000,
    overlap_tokens: int = 4,            
    approx_chars_per_token: int = 6
):
    n = len(text)
    i = 0
    chunks = []
    overlap_chars = max(0, overlap_tokens * approx_chars_per_token)

    while i < n:
        end = min(i + max_chars, n)
        win_start = max(i, end - boundary_window)

        # Search backward for a nice boundary within [win_start, end)
        window = text[win_start:end]
        candidates = []

        # Patterns to try, in preference order
        back_pats = ['\n\n', '\n', '. ', '! ', '? ', ' ']
        for pat in back_pats:
            pos = window.rfind(pat)
            if pos != -1:
                # convert to absolute index; include the boundary chars
                candidates.append(win_start + pos + len(pat))

        if candidates:
            cut = max(candidates)
        else:
            # No boundary behind: try looking a bit ahead for any boundary
            look_ahead_end = min(end + boundary_window, n)
            ahead = text[end:look_ahead_end]

            fwd_cut = None
            fwd_pats = ['\n\n', '\n', '. ', '! ', '? ', ' ']
            for pat in fwd_pats:
                pos = ahead.find(pat)
                if pos != -1:
                    fwd_cut = end + pos + len(pat)
                    break

            cut = fwd_cut if fwd_cut is not None else end

        chunk = text[i:cut].strip()
        if chunk:
            chunks.append(chunk)

        if cut >= n:
            break

        next_i = cut - overlap_chars if overlap_chars > 0 else cut
        if next_i <= i:
            next_i = cut
        i = next_i

    return chunks


In [None]:
#Preprocessing Orchestrator hotel_reviews context_size 2
import pandas as pd
import torch
txt_path = "./scifi_reduced.txt"

with open(txt_path, "r", encoding="utf-8") as f:
    big_text = f.read()
    
print(f"Loaded {len(big_text)} characters from {txt_path}")
data_input = chunk_text_smart(
    big_text,
    max_chars=50_000,
    boundary_window=1000,
    overlap_tokens=2 * 2,           
    approx_chars_per_token=6
)


context_size = 2
preprocessed_data = custom_preprocess(data_input)
data, word_to_ix, vocab_size = get_bag_of_words(preprocessed_data, context_size)
completely_processed_x = []
completely_processed_y = []
for bag_of_context_word, target_word in data:
    completely_processed_x.append(make_context_vector(bag_of_context_word, word_to_ix))
    completely_processed_y.append(word_to_ix[target_word])

input_x = torch.stack(completely_processed_x)
input_y = torch.tensor(completely_processed_y, dtype=torch.long)
print(completely_processed_x[:5])
print(completely_processed_y[:5])

print("Example pair:")
print("Context tensor:", input_x[0])
print("Target index:", input_y[0])


Loaded 43062636 characters from ./scifi_reduced.txt
get_bag_of_words [(['a', 'chat', 'the', 'editor'], 'with'), (['chat', 'with', 'editor', 'i'], 'the'), (['with', 'the', 'i', 'science'], 'editor'), (['the', 'editor', 'science', 'fiction'], 'i'), (['editor', 'i', 'fiction', 'magazine'], 'science')]
[tensor([    4, 12886, 83144, 23705]), tensor([12886, 94224, 23705, 39007]), tensor([94224, 83144, 39007, 72074]), tensor([83144, 23705, 72074, 28322]), tensor([23705, 39007, 28322, 49537])]
[94224, 83144, 23705, 39007, 72074]
Example pair:
Context tensor: tensor([    4, 12886, 83144, 23705])
Target index: tensor(94224)


In [None]:
import os
import torch
import torch.nn as nn

input_x = input_x.long()
input_y = input_y.long()

model = CBOW(vocab_size).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model.train()
batch_size = 64
num_epochs = 25              # upper limit
patience = 3                  # stop after 3 epochs without improvement
min_delta = 1e-4              # required improvement to reset patience

N = input_x.size(0)
best_loss = float("inf")
epochs_no_improve = 0

for epoch in range(1, num_epochs + 1):
    total_loss = 0.0

    for start in range(0, N, batch_size):
        end = start + batch_size
        x_batch = input_x[start:end].to(device, non_blocking=True)
        y_batch = input_y[start:end].to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)
        logits = model(x_batch)
        loss = criterion(logits, y_batch)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x_batch.size(0)

    avg_loss = total_loss / N
    print(f"Epoch {epoch:3d} - loss: {avg_loss:.6f}")

    # --- early stopping check ---
    if best_loss - avg_loss > min_delta:
        best_loss = avg_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1
        if epochs_no_improve >= patience:
            print(f"Early stopping: no improvement for {patience} epochs.")
            break



model_name = "CBOW2_scify.pt"  

if os.path.exists(model_name):
    os.remove(model_name)

# save the entire model Safed only weights, therefore rebuilding later...
torch.save(model.state_dict(), model_name)

print(f"Model saved as {model_name}")


Epoch   1 - loss: 6.395333
Epoch   2 - loss: 6.114575
Epoch   3 - loss: 5.986089
Epoch   4 - loss: 5.916470
Epoch   5 - loss: 5.867760
Epoch   6 - loss: 5.827909
Epoch   7 - loss: 5.793874
Epoch   8 - loss: 5.763170
Epoch   9 - loss: 5.735227
Epoch  10 - loss: 5.709516
Epoch  11 - loss: 5.685879
Epoch  12 - loss: 5.663949
Epoch  13 - loss: 5.643700
Epoch  14 - loss: 5.625056
Epoch  15 - loss: 5.607838
Epoch  16 - loss: 5.592004
Epoch  17 - loss: 5.577324
Epoch  18 - loss: 5.563733
Epoch  19 - loss: 5.551004
Epoch  20 - loss: 5.539299
Epoch  21 - loss: 5.528451
Epoch  22 - loss: 5.518379
Epoch  23 - loss: 5.509089
Epoch  24 - loss: 5.500560
Epoch  25 - loss: 5.492636
Model saved as CBOW2_scify.pt


### 2.1 For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. (CBOW2 and optionally for CBOW5)
Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model.    

🗒❓ List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   


In [None]:
import re
from collections import Counter
from typing import List, Tuple, Dict
import torch
import torch.nn.functional as F
# function to get the 100 most and least frequent words, so that one does not have to guess.
def print_freq_extremes(path: str) -> Tuple[List[Tuple[str, int]], List[Tuple[str, int]]]:

    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        text = f.read().lower()

    tokens = re.findall(r"\b[a-zA-Z]+\b", text)
    freq = Counter(tokens)

    most_100 = freq.most_common(100)
    least_100 = sorted(freq.items(), key=lambda x: x[1])[:100]

    print("=== 100 MOST FREQUENT WORDS ===")
    for w, c in most_100:
        print(f"{w}\t{c}")

    print("\n=== 100 LEAST FREQUENT WORDS ===")
    for w, c in least_100:
        print(f"{w}\t{c}")

    return most_100, least_100

In [7]:

from typing import List, Tuple, Dict
import torch.nn.functional as F

def nearest_neighbors_for_words(
    words: List[str],
    model_path: str,
    topn: int = 5
) -> Dict[str, List[Tuple[str, float]]]:

    # Load model checkpoint
    checkpoint = torch.load(model_path, map_location="cpu")
    
    # Try to extract embeddings
    if "state_dict" in checkpoint:
        state_dict = checkpoint["state_dict"]
    else:
        state_dict = checkpoint

    # Find the embedding matrix (by name or type)
    emb = None
    for k, v in state_dict.items():
        if "embedding" in k or "embeddings" in k:
            emb = v
            break

    if emb is None:
        raise ValueError("Could not find embedding layer in model checkpoint.")

    embeddings = emb.detach()
    vocab = list(range(embeddings.shape[0]))

    # If vocabulary mapping is stored
    if "vocab" in checkpoint:
        vocab = checkpoint["vocab"]
    elif "word_to_ix" in checkpoint:
        vocab = checkpoint["word_to_ix"]
    else:
        raise ValueError("No vocabulary mapping ('vocab' or 'word_to_ix') found in checkpoint.")

    # Reverse mapping
    if isinstance(vocab, dict):
        ix_to_word = {i: w for w, i in vocab.items()}
        word_to_ix = vocab
    else:
        raise ValueError("Expected vocab to be a dict mapping word→index.")

    # Normalize embeddings for cosine similarity
    norm_emb = F.normalize(embeddings, p=2, dim=1)

    results = {}
    for word in words:
        if word not in word_to_ix:
            results[word] = []
            continue

        idx = word_to_ix[word]
        vec = norm_emb[idx].unsqueeze(0)
        sims = torch.mm(vec, norm_emb.T).squeeze(0)
        topk = torch.topk(sims, k=topn + 1)

        neighbors = []
        for score, i in zip(topk.values.tolist(), topk.indices.tolist()):
            neighbor_word = ix_to_word[i]
            if neighbor_word != word:
                neighbors.append((neighbor_word, round(float(score), 4)))
            if len(neighbors) == topn:
                break
        results[word] = neighbors

    return results

In [None]:
_ = print_freq_extremes("tripadvisor_hotel_reviews_reduced.csv")

=== 100 MOST FREQUENT WORDS ===
hotel	24494
room	17100
not	15653
great	10473
t	9539
n	9320
good	8616
staff	8170
stay	7563
did	6870
nice	6225
just	6184
rooms	6043
no	5750
location	5618
stayed	5147
service	5080
night	4959
time	4956
beach	4924
day	4912
breakfast	4772
clean	4705
food	4676
like	4061
place	3898
really	3897
resort	3890
pool	3714
the	3703
friendly	3429
people	3415
small	3259
walk	3102
little	3098
got	3068
excellent	3046
area	2978
best	2880
helpful	2779
bar	2714
restaurant	2572
water	2500
restaurants	2492
trip	2491
bathroom	2483
bed	2404
view	2403
recommend	2398
beautiful	2356
floor	2337
went	2266
comfortable	2253
desk	2222
nights	2180
way	2146
right	2146
check	2107
want	2097
city	2054
better	2052
hotels	2048
make	2027
away	2017
wonderful	2011
free	2004
bit	1936
booked	1874
street	1871
price	1856
large	1842
reviews	1842
minutes	1833
buffet	1767
say	1761
new	1741
days	1724
quite	1712
lobby	1686
experience	1635
loved	1605
morning	1592
going	1569
close	1546
shower	1539
airport	153

Since safing the input-output vocab after training, the vocabulary has to be rebuilt for each of the models. The code will be replicated for the three different models. Redefined some of the prior cells, to not trigger them and cause some cascading effects.

In [None]:
#SCIFI weight-vocab alignment
import torch
import torch.nn as nn
import pandas as pd
from collections import Counter
import spacy

EMBEDDING_DIM = 100
class CBOW(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, EMBEDDING_DIM)
        self.linear = nn.Linear(EMBEDDING_DIM, vocab_size)
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        context_embed = embeds.mean(dim=1)
        return self.linear(context_embed)

nlp = spacy.load("en_core_web_sm", disable=["ner"])
stop_keep = {"not", "no", "n't"}
custom_stops = (nlp.Defaults.stop_words - stop_keep)  
def custom_preprocess(dataset):
    processed = []
    for doc in nlp.pipe(dataset, batch_size=128, n_process=1):
        tokens = [
            t.lemma_.lower()
            for t in doc
            if not t.is_punct
            and not t.is_space
            and t.lemma_ != ""
            and (t.is_alpha or t.like_num)
        ]
        if tokens:
            processed.append(tokens)
    return processed

def get_bag_of_words(data_input, context_size):
    words = [word for sentence in data_input for word in sentence]
    vocab = sorted(set(words))     
    vocab_size = len(vocab)
    word_to_ix = {word: i for i, word in enumerate(vocab)}
    data = []
    for sentence in data_input:
        for i in range(context_size, len(sentence) - context_size):
            context = [sentence[i + j] for j in range(-context_size, context_size + 1) if j != 0]
            target = sentence[i]
            data.append((context, target))
    return data, word_to_ix, vocab_size

# ---- rebuild the vocab exactly as during training ----
txt_path = "./scifi_reduced.txt"  
context_size = 2                                      
with open(txt_path, "r", encoding="utf-8") as f:
    big_text = f.read()
    
print(f"Loaded {len(big_text)} characters from {txt_path}")
data_input = chunk_text_smart(
    big_text,
    max_chars=50_000,
    boundary_window=1000,
    overlap_tokens=2 * 2,           
    approx_chars_per_token=6
)


preprocessed = custom_preprocess(data_input)
_, word_to_ix, vocab_size = get_bag_of_words(preprocessed, context_size)

# ---- load saved weights ----
state = torch.load("CBOW2_scify.pt", map_location="cpu")
state = state.get("state_dict", state)

# sanity checks
E = state["embeddings.weight"]              # [V, D]
assert E.shape[0] == len(word_to_ix), f"Vocab size mismatch: weights={E.shape[0]} vs rebuilt={len(word_to_ix)}"
assert E.shape[1] == EMBEDDING_DIM, f"Embedding dim mismatch: weights={E.shape[1]} vs code={EMBEDDING_DIM}"

W = state["linear.weight"]
assert W.shape[0] == len(word_to_ix) and W.shape[1] == EMBEDDING_DIM

# write a fixed checkpoint WITH vocab 
fixed_ckpt = {
    "state_dict": state,
    "word_to_ix": word_to_ix,
    "ix_to_word": {i: w for w, i in word_to_ix.items()},
    "meta": {
        "embedding_dim": EMBEDDING_DIM,
        "vocab_size": len(word_to_ix),
        "model_type": "CBOW2",
        "context_size": context_size,
    }
}
torch.save(fixed_ckpt, "CBOW2_scifi_fixed.pt")
print("Wrote CBOW2_scifi_fixed.pt with vocab included")


Loaded 43062636 characters from ./scifi_reduced.txt
Wrote CBOW2_scifi_fixed.pt with vocab included


In [15]:
#Hotel weight-vocab alignment
import torch
import torch.nn as nn
import pandas as pd
from collections import Counter
import spacy

EMBEDDING_DIM = 100
class CBOW(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, EMBEDDING_DIM)
        self.linear = nn.Linear(EMBEDDING_DIM, vocab_size)
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        context_embed = embeds.mean(dim=1)
        return self.linear(context_embed)

# ---- your preprocessing (unchanged) ----
nlp = spacy.load("en_core_web_sm", disable=["ner"])
stop_keep = {"not", "no", "n't"}
custom_stops = (nlp.Defaults.stop_words - stop_keep)  
def custom_preprocess(dataset):
    processed = []
    for doc in nlp.pipe(dataset, batch_size=128, n_process=1):
        tokens = [
            t.lemma_.lower()
            for t in doc
            if not t.is_punct
            and not t.is_space
            and t.lemma_ != ""
            and (t.is_alpha or t.like_num)
        ]
        if tokens:
            processed.append(tokens)
    return processed

def get_bag_of_words(data_input, context_size):
    words = [word for sentence in data_input for word in sentence]
    vocab = sorted(set(words))     
    vocab_size = len(vocab)
    word_to_ix = {word: i for i, word in enumerate(vocab)}
    data = []
    for sentence in data_input:
        for i in range(context_size, len(sentence) - context_size):
            context = [sentence[i + j] for j in range(-context_size, context_size + 1) if j != 0]
            target = sentence[i]
            data.append((context, target))
    return data, word_to_ix, vocab_size

# ---- rebuild the vocab exactly as during training ----
csv_path = "./tripadvisor_hotel_reviews_reduced.csv"  
context_size = 2                                      
df = pd.read_csv(csv_path)
data_input = df['Review'].astype(str).tolist()
preprocessed = custom_preprocess(data_input)
_, word_to_ix, vocab_size = get_bag_of_words(preprocessed, context_size)

# ---- load saved weights ----
state = torch.load("CBOW2_hotel.pt", map_location="cpu")
state = state.get("state_dict", state)

# sanity checks
E = state["embeddings.weight"]              # [V, D]
assert E.shape[0] == len(word_to_ix), f"Vocab size mismatch: weights={E.shape[0]} vs rebuilt={len(word_to_ix)}"
assert E.shape[1] == EMBEDDING_DIM, f"Embedding dim mismatch: weights={E.shape[1]} vs code={EMBEDDING_DIM}"

W = state["linear.weight"]
assert W.shape[0] == len(word_to_ix) and W.shape[1] == EMBEDDING_DIM

# write a fixed checkpoint WITH vocab 
fixed_ckpt = {
    "state_dict": state,
    "word_to_ix": word_to_ix,
    "ix_to_word": {i: w for w, i in word_to_ix.items()},
    "meta": {
        "embedding_dim": EMBEDDING_DIM,
        "vocab_size": len(word_to_ix),
        "model_type": "CBOW2",
        "context_size": context_size,
    }
}
torch.save(fixed_ckpt, "CBOW2_hotel_fixed.pt")
print("Wrote CBOW2_hotel_fixed.pt with vocab included")


Wrote CBOW2_hotel_fixed.pt with vocab included


In [16]:
targets = ["room", "staff", "racquets", "stay", "walk", "dodging", "clean", "good", "surprinsingly"]
neighbors = nearest_neighbors_for_words(targets, "CBOW2_hotel_fixed.pt", topn=5)
for w, nbrs in neighbors.items():
    print(f"\n{w}:")
    for n, s in nbrs:
        print(f"  {n}\t{s:.4f}")


room:
  suite	0.5264
  apartment	0.4315
  balcony	0.4253
  table	0.4097
  pillow	0.4025

staff:
  employee	0.5560
  team	0.5155
  receptionist	0.4697
  manager	0.4581
  mr	0.4338

racquets:

stay:
  hotel	0.5321
  2005	0.4807
  experience	0.4643
  rue	0.4589
  staying	0.4540

walk:
  walking	0.6636
  stroll	0.6278
  ride	0.4975
  drive	0.4363
  close	0.4342

dodging:

clean:
  effeciently	0.3976
  spotless	0.3952
  appoint	0.3780
  immaculate	0.3677
  quam	0.3639

good:
  great	0.7443
  excellent	0.6587
  decent	0.6168
  perfect	0.5294
  fantastic	0.4587

surprinsingly:
  16.30	0.6507
  coronas	0.5892
  reservasion	0.5335
  piscoteo	0.5203
  sabo	0.4797


In [17]:
#Hotel weight-vocab alignment
import torch
import torch.nn as nn
import pandas as pd
from collections import Counter
import spacy

EMBEDDING_DIM = 100
class CBOW(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, EMBEDDING_DIM)
        self.linear = nn.Linear(EMBEDDING_DIM, vocab_size)
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        context_embed = embeds.mean(dim=1)
        return self.linear(context_embed)

# ---- your preprocessing (unchanged) ----
nlp = spacy.load("en_core_web_sm", disable=["ner"])
stop_keep = {"not", "no", "n't"}
custom_stops = (nlp.Defaults.stop_words - stop_keep)  
def custom_preprocess(dataset):
    processed = []
    for doc in nlp.pipe(dataset, batch_size=128, n_process=1):
        tokens = [
            t.lemma_.lower()
            for t in doc
            if not t.is_punct
            and not t.is_space
            and t.lemma_ != ""
            and (t.is_alpha or t.like_num)
        ]
        if tokens:
            processed.append(tokens)
    return processed

def get_bag_of_words(data_input, context_size):
    words = [word for sentence in data_input for word in sentence]
    vocab = sorted(set(words))     
    vocab_size = len(vocab)
    word_to_ix = {word: i for i, word in enumerate(vocab)}
    data = []
    for sentence in data_input:
        for i in range(context_size, len(sentence) - context_size):
            context = [sentence[i + j] for j in range(-context_size, context_size + 1) if j != 0]
            target = sentence[i]
            data.append((context, target))
    return data, word_to_ix, vocab_size

# ---- rebuild the vocab exactly as during training ----
csv_path = "./tripadvisor_hotel_reviews_reduced.csv"  
context_size = 5                                      
df = pd.read_csv(csv_path)
data_input = df['Review'].astype(str).tolist()
preprocessed = custom_preprocess(data_input)
_, word_to_ix, vocab_size = get_bag_of_words(preprocessed, context_size)

# ---- load saved weights ----
state = torch.load("CBOW5_hotel.pt", map_location="cpu")
state = state.get("state_dict", state)

# sanity checks
E = state["embeddings.weight"]              # [V, D]
assert E.shape[0] == len(word_to_ix), f"Vocab size mismatch: weights={E.shape[0]} vs rebuilt={len(word_to_ix)}"
assert E.shape[1] == EMBEDDING_DIM, f"Embedding dim mismatch: weights={E.shape[1]} vs code={EMBEDDING_DIM}"

W = state["linear.weight"]
assert W.shape[0] == len(word_to_ix) and W.shape[1] == EMBEDDING_DIM

# write a fixed checkpoint WITH vocab 
fixed_ckpt = {
    "state_dict": state,
    "word_to_ix": word_to_ix,
    "ix_to_word": {i: w for w, i in word_to_ix.items()},
    "meta": {
        "embedding_dim": EMBEDDING_DIM,
        "vocab_size": len(word_to_ix),
        "model_type": "CBOW5",
        "context_size": context_size,
    }
}
torch.save(fixed_ckpt, "CBOW5_hotel_fixed.pt")
print("Wrote CBOW5_hotel_fixed.pt with vocab included")


Wrote CBOW5_hotel_fixed.pt with vocab included


In [18]:
targets = ["room", "staff", "racquets", "stay", "walk", "dodging", "clean", "good", "surprinsingly"]
neighbors = nearest_neighbors_for_words(targets, "CBOW5_hotel_fixed.pt", topn=5)
for w, nbrs in neighbors.items():
    print(f"\n{w}:")
    for n, s in nbrs:
        print(f"  {n}\t{s:.4f}")


room:
  bed	0.3633
  inconvienence	0.3411
  bathroom	0.3401
  horse	0.3399
  appeared	0.3356

staff:
  employee	0.5529
  receptionist	0.4176
  2305	0.4145
  angie	0.4125
  emp	0.4057

racquets:

stay:
  book	0.5163
  spend	0.4824
  visit	0.4752
  staying	0.4421
  return	0.4345

walk:
  walking	0.6235
  stroll	0.5813
  near	0.4965
  close	0.4896
  shopping	0.4450

dodging:

clean:
  spotless	0.4923
  seconds	0.3750
  squirt	0.3542
  lucca	0.3475
  cochroch	0.3468

good:
  great	0.6947
  decent	0.6644
  well	0.5365
  excellent	0.5222
  convenient	0.4267

surprinsingly:
  um	0.4495
  coatstand	0.4054
  unclog	0.3866
  profoundly	0.3848
  biege	0.3761


### 2.2 Repeat 2.1 for SciFi Dataset

🗒❓ List your findings for SciFi Dataset as well, similarly to 2.1

In [19]:
_ = print_freq_extremes("scifi_reduced.txt")

=== 100 MOST FREQUENT WORDS ===
the	445138
and	196128
of	188818
a	184875
to	184710
he	125543
i	113959
it	109095
in	105600
was	99888
you	89188
that	87798
his	71401
s	62986
had	56174
t	53738
on	52402
for	52068
with	48242
but	47658
as	46407
at	43956
they	43041
be	39649
we	39207
is	37945
said	36714
not	34675
him	33331
have	33254
there	31351
from	31232
were	30124
all	29656
this	29392
out	28621
one	28384
she	27732
what	27680
if	27664
her	25980
up	25718
no	24886
by	24063
an	22621
would	22544
me	22062
them	21750
been	20544
or	20407
could	20325
so	20106
then	19998
into	19931
my	19599
like	18756
are	18449
can	18377
about	17698
your	17659
now	17630
do	17581
when	16685
back	16443
their	16392
time	15975
man	15602
more	14572
down	14110
know	13929
just	13830
which	13391
will	13223
over	13141
only	13120
don	12892
ll	12493
some	12320
get	12223
who	11967
here	11924
d	11893
re	11497
any	11489
two	11417
other	11216
did	11124
its	10823
see	10688
way	10642
than	10534
how	10454
through	10406
our	10232
m	1022

In [8]:
targets = ["man", "way", "longshoreman", "was", "said", "climaxes", "right", "all", "fireless"]
neighbors = nearest_neighbors_for_words(targets, "CBOW2_scifi_fixed.pt", topn=5)
for w, nbrs in neighbors.items():
    print(f"\n{w}:")
    for n, s in nbrs:
        print(f"  {n}\t{s:.4f}")


man:
  people	0.6829
  one	0.6451
  terran	0.6443
  person	0.6443
  you	0.6307

way:
  fashion	0.6552
  path	0.5830
  day	0.5684
  suggestion	0.5650
  urge	0.5638

longshoreman:
  carnivore	0.4293
  rjhere	0.4177
  puppeteer	0.4148
  dike	0.4140
  starer	0.4063

was:
  incinerate	0.4469
  reinterpret	0.4458
  delectable	0.4250
  himsdf	0.4233
  caugjit	0.4225

said:
  luna	0.5177
  donderevo	0.5088
  insist	0.5051
  encircle	0.4825
  ha	0.4808

climaxes:

right:
  however	0.6548
  after	0.6526
  correct	0.6516
  put	0.6460
  must	0.6361

all:
  now	0.7282
  anyway	0.7154
  again	0.6901
  both	0.6844
  some	0.6749

fireless:
  scarify	0.4168
  altoplano	0.4091
  senatorial	0.3939
  generous	0.3866
  coraving	0.3841


### 2.3 🗒❓ How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.

The embeddings based on the sci-fi data tend to be better. Their cosine similarity scores tend to be better. Further upon investigating the different most likely words, they have less often random numbers. An example would be that the most similar "word" to surprisingly based on the hotel-review embedding with a sliding window of 2 is "16:30" with a cosine similarity of 0.65. 
Another interesting fact is, that the adjectives of the sci-fi embeddings tend to have higher cosine similarities, induicating better performance. This comes surprising to us, because we expected the reviews to have a lot of adjectives in them, resulting in a model with good performance on adjectives.   
But the tendency, that the sci-fi dataset is better, does not surprise, because the training took roughly 2.5 times as long. It would be interesting though, to compare them with the same ammount of iterations over the trainingsdata.
❓ List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   

Please edit the cell for good layout of the examples: 
### CBOW2_Hotel:

room:
  suite	0.5264
  apartment	0.4315
  balcony	0.4253
  table	0.4097
  pillow	0.4025

staff:
  employee	0.5560
  team	0.5155
  receptionist	0.4697
  manager	0.4581
  mr	0.4338

racquets:

stay:
  hotel	0.5321
  2005	0.4807
  experience	0.4643
  rue	0.4589
  staying	0.4540

walk:
  walking	0.6636
  stroll	0.6278
  ride	0.4975
  drive	0.4363
  close	0.4342

dodging:

clean:
  effeciently	0.3976
  spotless	0.3952
  appoint	0.3780
  immaculate	0.3677
  quam	0.3639

good:
  great	0.7443
  excellent	0.6587
  decent	0.6168
  perfect	0.5294
  fantastic	0.4587

surprinsingly:
  16.30	0.6507
  coronas	0.5892
  reservasion	0.5335
  piscoteo	0.5203
  sabo	0.4797
### CBOW5_Hotel

room:
  bed	0.3633
  inconvienence	0.3411
  bathroom	0.3401
  horse	0.3399
  appeared	0.3356

staff:
  employee	0.5529
  receptionist	0.4176
  2305	0.4145
  angie	0.4125
  emp	0.4057

racquets:

stay:
  book	0.5163
  spend	0.4824
  visit	0.4752
  staying	0.4421
  return	0.4345

walk:
  walking	0.6235
  stroll	0.5813
  near	0.4965
  close	0.4896
  shopping	0.4450

dodging:

clean:
  spotless	0.4923
  seconds	0.3750
  squirt	0.3542
  lucca	0.3475
  cochroch	0.3468

good:
  great	0.6947
  decent	0.6644
  well	0.5365
  excellent	0.5222
  convenient	0.4267

surprinsingly:
  um	0.4495
  coatstand	0.4054
  unclog	0.3866
  profoundly	0.3848
  biege	0.3761

### CBOW2_sci-if:

man:
  people	0.6829
  one	0.6451
  terran	0.6443
  person	0.6443
  you	0.6307

way:
  fashion	0.6552
  path	0.5830
  day	0.5684
  suggestion	0.5650
  urge	0.5638

longshoreman:
  carnivore	0.4293
  rjhere	0.4177
  puppeteer	0.4148
  dike	0.4140
  starer	0.4063

was:
  incinerate	0.4469
  reinterpret	0.4458
  delectable	0.4250
  himsdf	0.4233
  caugjit	0.4225

said:
  luna	0.5177
  donderevo	0.5088
  insist	0.5051
  encircle	0.4825
  ha	0.4808

climaxes:

right:
  however	0.6548
  after	0.6526
  correct	0.6516
  put	0.6460
  must	0.6361

all:
  now	0.7282
  anyway	0.7154
  again	0.6901
  both	0.6844
  some	0.6749

fireless:
  scarify	0.4168
  altoplano	0.4091
  senatorial	0.3939
  generous	0.3866
  coraving	0.3841

### 2.4 Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings.

🗒❓ Do they have different neighbours? If yes, can you reason why?

In [13]:
print("CBOW2_hotel")
targets = ["man", "concierge"]
neighbors = nearest_neighbors_for_words(targets, "CBOW2_hotel_fixed.pt", topn=5)
for w, nbrs in neighbors.items():
    print(f"\n{w}:")
    for n, s in nbrs:
        print(f"  {n}\t{s:.4f}")


print("CBOW5_hotel")
neighbors = nearest_neighbors_for_words(targets, "CBOW5_hotel_fixed.pt", topn=5)
for w, nbrs in neighbors.items():
    print(f"\n{w}:")
    for n, s in nbrs:
        print(f"  {n}\t{s:.4f}")


print("CBOW5_sci-fi")
neighbors = nearest_neighbors_for_words(targets, "CBOW2_scifi_fixed.pt", topn=5)
for w, nbrs in neighbors.items():
    print(f"\n{w}:")
    for n, s in nbrs:
        print(f"  {n}\t{s:.4f}")

CBOW2_hotel

man:
  guy	0.5501
  manager	0.5014
  chest	0.4790
  staff	0.4091
  waiter	0.4030

concierge:
  manager	0.5335
  call	0.5093
  desk	0.4466
  professional	0.4307
  attentive	0.4205
CBOW5_hotel

man:
  guy	0.5131
  copi	0.4208
  humane	0.3992
  mislead	0.3864
  gentleman	0.3596

concierge:
  interact	0.3936
  breathtake	0.3743
  shelf	0.3610
  tujague	0.3604
  memorable	0.3534
CBOW5_sci-fi

man:
  people	0.6829
  one	0.6451
  terran	0.6443
  person	0.6443
  you	0.6307

concierge:
  necesssary	0.4105
  humbug	0.4065
  mercenariness	0.4006
  helpfulness	0.3892
  fflhffier	0.3883


The nearest neighbors differ. The man is encoded as someone who works at a hotel, like professional, gentleman, manager, staff, watier in the Hotel-review trained and more general in the sci-fi one iwht people and person. With concierge it is even more extreme. While the Hotel-review trained CBOW knows the term and has mostly good neighbors, the sci-fi trained one has no clue, what this word means and has interesting neighbors. 

### 2.5 🗒❓ What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?    

When reading the different most similar words given a word of the two different sliding window sizes, the following trend can be observed:
In the CBOW2 either really good words are picked, or they fail catastrophical. In the CBOW5 the results tend to be not as strictly related to the searched word, but the catastropic wrong classification is less often the case. So they tend more to learn the general area of the word and pick one of the words, hence beeing constantly good, but never great. The CBOW2 tends to learn only the words really closely connected to the searched word. Resulting in either beeing spot on, or way off. So it tends to be occasionally great.

### Report
The lab report should contain a detailed description of the approaches you have used to solve this exercise. Please also include results.

Answers for the questions marked 🗒❓ goes here as well

❓ Describe your decisions for preprocessing the datasets

-lower casing: No case different tokens. Like and like would be different, of no lower casing would have been applied. Their meaning would have to be learned seperately. Would not be bad, if a really big dataset was used, but with our toy-sized dataset this makes a difference.
-punctuation removal: Similar reasoning. They would be included in the vocabulary resulting in more words, as above.
-lemmatization: similar idea of reducing the size of the vocabulary.
-further the first parsing of the sci-fi dataset was implemented to be able to use the same preprocessing as for the other dataset. The reasoning behind the long sequences was, that the least ammount of text is trunkated by the sliding window.

❓ Are predictions made by the model sensitive towards the context size?
Yes, as described prior:
When reading the different most similar words given a word of the two different sliding window sizes, the following trend can be observed:
In the CBOW2 either really good words are picked, or they fail catastrophical. In the CBOW5 the results tend to be not as strictly related to the searched word, but the catastropic wrong classification is less often the case. So they tend more to learn the general area of the word and pick one of the words, hence beeing constantly good, but never great. The CBOW2 tends to learn only the words really closely connected to the searched word. Resulting in either beeing spot on, or way off. So it tends to be occasionally great.