<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2024-Tutorial-Notebooks/blob/main/exercises/ex2/Exercise_2_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%matplotlib inline

### Source: [link](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.


There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.

* The physicist ran to the store.

* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.

# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.

You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.

# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).

In [2]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader

torch.manual_seed(1)

<torch._C.Generator at 0x794fd97884d0>

In [3]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.

In [4]:
CONTEXT_SIZE = 2

EMBEDDING_DIM = 10

# We will use Shakespeare Sonnet 2

test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]

# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

losses = []
loss_function = nn.NLLLoss() # Negative Log Likelihood Loss

model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)

optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()

    print("Loss in Epoch {ep}: {l}".format(ep=epoch, l=np.round(total_loss, 2))) # The loss decreased every iteration over the training data!
    losses.append(total_loss)

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
Loss in Epoch 0: 520.92
Loss in Epoch 1: 518.34
Loss in Epoch 2: 515.78
Loss in Epoch 3: 513.23
Loss in Epoch 4: 510.71
Loss in Epoch 5: 508.2
Loss in Epoch 6: 505.7
Loss in Epoch 7: 503.22
Loss in Epoch 8: 500.74
Loss in Epoch 9: 498.28


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

## Exercise Layout

### 1. <u>Training CBOW Embeddings</u>

1.1) Implement a CBOW Model by completing ```class CBOW(nn.Module)``` and train it on ```raw_text```.    

1.2) Load Datasets ```tripadvisor_hotel_reviews_reduced.csv``` and ```scifi_reduced.txt```.     

1.3) Decide preprocessing steps by completing the function ```def custom_preprocess()```. Describe your decisions. Note that it's your choice to create different preprocessing functions for hotel reviews and scifi datasets or use the same preprocessing function.            

1.4) Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.   

1.5) Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset. Are predictions made by the model sensitive towards the context size?

1.6) Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset.  

### 2. <u>Test your Embeddings</u>

Note - Do the following for CBOW2, and optionally for CBOW5

2.1) For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model. List them in your report and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   

2.2) Do the same for Sci-Fi dataset.   

2.3) How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.   

2.4) Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings. Do they have different neighbours? If yes, can you reason why?    

2.5) What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?   

### Tips

1. Switch from CPU to a GPU instance after you have confirmed that your training procedure is working correctly.

2. You can always save your intermediate results (embeddings, preprocessed dataset, model, etc.) in your google drive via colab

### 1.1 Create a CBOW Model by completing ```class CBOW(nn.Module)``` and test it on ```raw_text```

Implement CBOW in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.

* Make sure you know what shape each operation expects. Use .view() if you need to

  reshape.

In [5]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])

class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, inputs):
        # Average the embeddings of the context words
        embeddings = self.embeddings(inputs)
        avg_embeddings = torch.mean(embeddings, dim=1)
        # Pass through the linear layer
        outputs = self.linear(avg_embeddings)
        # Apply softmax
        probabilities = self.softmax(outputs)
        return probabilities


# create your model and train.  here are some functions to help you make
# the data ready for use by your module

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)  # example

# Making the model
cbow_2 = CBOW(vocab_size, EMBEDDING_DIM)

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


In [6]:
# This is a dummy dataset class to train a very basic CBOW model on the data 
class TrialDataset(torch.utils.data.Dataset):
    def __init__(self, data, word_to_ix):
        self.data = data
        self.word_to_ix = word_to_ix

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        context, target = self.data[idx]
        context_idxs = torch.tensor([self.word_to_ix[w] for w in context], dtype=torch.long)
        target_idx = torch.tensor(self.word_to_ix[target], dtype=torch.long)
        return context_idxs, target_idx

In [7]:
trial = TrialDataset(data, word_to_ix)
loader = torch.utils.data.DataLoader(trial, batch_size=1)

In [8]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

cbow_model = CBOW(vocab_size, EMBEDDING_DIM).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    total_loss = 0
    for context, target in loader:
        context = context.to(device)
        target = target.to(device)
        # Zero the gradients
        cbow_model.zero_grad()
        
        # Forward pass
        log_probs = cbow_model(context)
        
        # Compute the loss
        loss = loss_function(log_probs, target)
        
        # Backward pass and optimize
        loss.backward()
        optimizer.step()
        
        # Accumulate the loss
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}")

Epoch 1/10, Loss: 225.8045
Epoch 2/10, Loss: 225.7452
Epoch 3/10, Loss: 225.6941
Epoch 4/10, Loss: 225.6399
Epoch 5/10, Loss: 225.5827
Epoch 6/10, Loss: 225.5221
Epoch 7/10, Loss: 225.4580
Epoch 8/10, Loss: 225.3899
Epoch 9/10, Loss: 225.3173
Epoch 10/10, Loss: 225.2396


### 1.2 Load Datasets

In [9]:
!pip install gdown

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Installing collected packages: gdown
Successfully installed gdown-5.2.0


In [10]:
### Load Datasets tripadvisor_hotel_reviews_reduced.csv and scifi_reduced.txt
!gdown 1foE1JuZJeu5E_4qVge9kExzhvF32teuF # For Hotel Reviews
!gdown 13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75 # For Scifi-Text

Downloading...
From: https://drive.google.com/uc?id=1foE1JuZJeu5E_4qVge9kExzhvF32teuF
To: /kaggle/working/tripadvisor_hotel_reviews_reduced.csv
100%|███████████████████████████████████████| 7.36M/7.36M [00:00<00:00, 221MB/s]
Downloading...
From: https://drive.google.com/uc?id=13IWXrTjGTrfCd9l7dScZVO8ZvMicPU75
To: /kaggle/working/scifi_reduced.txt
100%|███████████████████████████████████████| 43.1M/43.1M [00:00<00:00, 186MB/s]


## PLEASE NOTE: Change the path of the CSV file if you are testing our solution on your local machine or on Google Colab

In [11]:
reviews = pd.read_csv('/kaggle/working/tripadvisor_hotel_reviews_reduced.csv')

with open(f'/kaggle/working/scifi_reduced.txt') as f:
    scifi = f.read().splitlines()

In [12]:
reviews

Unnamed: 0,Review,Rating
0,fantastic service large hotel caters business ...,5
1,"great hotel modern hotel good location, locate...",4
2,3 star plus glasgowjust got 30th november 4 da...,4
3,nice stayed hotel nov 19-23. great little bout...,4
4,great place wonderful hotel ideally located me...,5
...,...,...
9995,"great location.modern decor, time nyc, chose w...",5
9996,"nice, hotel beautiful walk, wonderful view nic...",4
9997,"dirty sheets clump hairl shower, stayed royal ...",1
9998,best la look forward having travel cross count...,5


In [13]:
scifi = pd.DataFrame({'text': scifi})
scifi

Unnamed: 0,text
0,A chat with the editor i # science fiction ...


### 1.3 Preprocess Datasets

### 🗒❓ Describe your decisions for preprocessing the datasets

In [14]:
### Complete the preprocessing function and apply it to the datasets
import re
import string  
from nltk.tokenize import word_tokenize

def custom_preprocess(text):
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove special characters and numbers 
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Rejoin tokens into cleaned text
    return ' '.join(tokens)

reviews['cleaned'] = reviews['Review'].apply(custom_preprocess)
scifi['cleaned'] = scifi['text'].apply(custom_preprocess)

In [15]:
reviews

Unnamed: 0,Review,Rating,cleaned
0,fantastic service large hotel caters business ...,5,fantastic service large hotel caters business ...
1,"great hotel modern hotel good location, locate...",4,great hotel modern hotel good location located...
2,3 star plus glasgowjust got 30th november 4 da...,4,star plus glasgowjust got th november day visi...
3,nice stayed hotel nov 19-23. great little bout...,4,nice stayed hotel nov great little boutique ho...
4,great place wonderful hotel ideally located me...,5,great place wonderful hotel ideally located me...
...,...,...,...
9995,"great location.modern decor, time nyc, chose w...",5,great locationmodern decor time nyc chose west...
9996,"nice, hotel beautiful walk, wonderful view nic...",4,nice hotel beautiful walk wonderful view nice ...
9997,"dirty sheets clump hairl shower, stayed royal ...",1,dirty sheets clump hairl shower stayed royal p...
9998,best la look forward having travel cross count...,5,best la look forward having travel cross count...


In [16]:
scifi

Unnamed: 0,text,cleaned
0,A chat with the editor i # science fiction ...,a chat with the editor i science fiction magaz...


### 1.4 Train CBOW2 with a context width of 2 (in both directions) for the Hotel Reviews dataset.

In [17]:
# Function to retrieve all unique words from a given df column 
def get_unique_words(df, column_name):
    # Join all text in the specified column into a single string
    all_text = ' '.join(df[column_name])
    
    # Tokenize the combined text into words
    tokens = word_tokenize(all_text)
    
    # Get unique words using a set
    unique_words = set(tokens)
    
    return unique_words

# Function to create all possible CBOWs per sentence for all sentences in a df
def generate_cbow(text, context_length):
    # Tokenize the input text
    tokens = word_tokenize(text.lower())  # Convert to lowercase for consistency
    
    cbow_pairs = []
    
    # Generate CBOW pairs
    for i in range(context_length, len(tokens) - context_length):
        # Define the context and target
        context = tokens[i - context_length:i] + tokens[i + 1:i + context_length + 1]
        target = tokens[i]
        
        cbow_pairs.append((context, target))
    
    return cbow_pairs

# Extract all unique words from the dataset and get vocab size
hotel_vocab = get_unique_words(reviews, 'cleaned')
hotel_vocab_size = len(hotel_vocab)

# Create the word to index dictionary 
hotel_word_to_ix = {word: i for i, word in enumerate(hotel_vocab)}

# Create a new column in the dataset for the CBOWs
reviews['cbows2'] = reviews['cleaned'].apply(lambda x: generate_cbow(x, 2))
reviews['cbows2']

0       [([fantastic, service, hotel, caters], large),...
1       [([great, hotel, hotel, good], modern), ([hote...
2       [([star, plus, got, th], glasgowjust), ([plus,...
3       [([nice, stayed, nov, great], hotel), ([stayed...
4       [([great, place, hotel, ideally], wonderful), ...
                              ...                        
9995    [([great, locationmodern, time, nyc], decor), ...
9996    [([nice, hotel, walk, wonderful], beautiful), ...
9997    [([dirty, sheets, hairl, shower], clump), ([sh...
9998    [([best, la, forward, having], look), ([la, lo...
9999    [([great, location, helpful, staff], extremely...
Name: cbows2, Length: 10000, dtype: object

In [18]:
# A class to help store the dataset for easy access during training 
class CBOWDataset(Dataset):
    def __init__(self, df, vocab_to_idx, col_name):
        """
        df: DataFrame containing column with CBOW pairs [[context_words], target_word]
        vocab_to_idx: dictionary mapping words to indices
        """
        self.data = []
        # Iterate through each row in DataFrame
        for row_pairs in df[col_name]:
            # Iterate through each CBOW pair in the row
            for context_words, target_word in row_pairs:
                try:
                    # Convert context words to indices
                    context_indices = torch.tensor([vocab_to_idx[w] for w in context_words], dtype=torch.long)
                    # Convert target word to index
                    target_idx = torch.tensor(vocab_to_idx[target_word], dtype=torch.long)
                    self.data.append((context_indices, target_idx))
                    
                except Exception as e:
                    print(f"Error processing pair - Context: {context_words}, Target: {target_word}")
                    print(f"Error message: {str(e)}")
                    continue
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

# Creating the dataset
reviews_cb2_dataset = CBOWDataset(reviews, hotel_word_to_ix, 'cbows2')
# Creating a dataloader for the dataset
review_cb2_dataloader = DataLoader(reviews_cb2_dataset, batch_size=128, shuffle=True)

# Setting flag for the GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Creating the model
review_cbow_2 = CBOW(hotel_vocab_size, 50) # Recommended embedding dimension is 50

In [19]:
def train_model(model, dataloader, loss_function, num_epochs, optimizer):
    model.to(device)
    
    losses = []
    for epoch in range(num_epochs):
        total_loss = 0
        num_batches = 0
        for context, target in dataloader:
            context = context.to(device)
            target = target.to(device)
            # Zero the gradients
            review_cbow_2.zero_grad()
            
            # Forward pass
            log_probs = model(context)
            
            # Compute the loss
            loss = loss_function(log_probs, target)
            
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
            
            # Accumulate the loss
            total_loss += loss.item()
            num_batches += 1
        
        avg_loss = total_loss / num_batches
        losses.append(avg_loss)  # Store for plotting later
        
        print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

In [20]:
train_model(model=review_cbow_2, 
            dataloader=review_cb2_dataloader, 
            loss_function=nn.CrossEntropyLoss(), 
            num_epochs=15, 
            optimizer=optim.Adam(review_cbow_2.parameters(), lr=0.01))

Epoch 1/15, Average Loss: 10.7432
Epoch 2/15, Average Loss: 10.7285
Epoch 3/15, Average Loss: 10.7218
Epoch 4/15, Average Loss: 10.7172
Epoch 5/15, Average Loss: 10.7143
Epoch 6/15, Average Loss: 10.7118
Epoch 7/15, Average Loss: 10.7098
Epoch 8/15, Average Loss: 10.7080
Epoch 9/15, Average Loss: 10.7065
Epoch 10/15, Average Loss: 10.7053
Epoch 11/15, Average Loss: 10.7044
Epoch 12/15, Average Loss: 10.7035
Epoch 13/15, Average Loss: 10.7026
Epoch 14/15, Average Loss: 10.7018
Epoch 15/15, Average Loss: 10.7011


### 1.5 Train CBOW5 with a context width of 5 (in both directions) for the Hotel Reviews dataset.  



🗒❓ Are predictions made by the model sensitive towards the context size?

In [21]:
# Create a new column in the dataset for the CBOWs
reviews['cbows5'] = reviews['cleaned'].apply(lambda x: generate_cbow(x, 5))
reviews['cbows5']

# Creating the dataset
reviews_cb5_dataset = CBOWDataset(reviews, hotel_word_to_ix, 'cbows5')
# Creating a dataloader for the dataset
review_cb5_dataloader = DataLoader(reviews_cb5_dataset, batch_size=128, shuffle=True)

# Creating the model
review_cbow_5 = CBOW(hotel_vocab_size, 50) # Recommended embedding dimension is 50

train_model(model=review_cbow_5, 
            dataloader=review_cb5_dataloader, 
            loss_function=nn.CrossEntropyLoss(), 
            num_epochs=15, 
            optimizer=optim.Adam(review_cbow_5.parameters(), lr=0.01))

Epoch 1/15, Average Loss: 10.7718
Epoch 2/15, Average Loss: 10.7726
Epoch 3/15, Average Loss: 10.7739
Epoch 4/15, Average Loss: 10.7747
Epoch 5/15, Average Loss: 10.7751
Epoch 6/15, Average Loss: 10.7754
Epoch 7/15, Average Loss: 10.7756
Epoch 8/15, Average Loss: 10.7756
Epoch 9/15, Average Loss: 10.7752
Epoch 10/15, Average Loss: 10.7750
Epoch 11/15, Average Loss: 10.7749
Epoch 12/15, Average Loss: 10.7749
Epoch 13/15, Average Loss: 10.7748
Epoch 14/15, Average Loss: 10.7749
Epoch 15/15, Average Loss: 10.7748


### 1.6 Train CBOW2 with a context width of 2 (in both directions) for the Sci-Fi story dataset

In [24]:
# Extract all unique words from the dataset and get vocab size
scifi_vocab = get_unique_words(scifi, 'cleaned')
scifi_vocab_size = len(scifi_vocab)

# Create the word to index dictionary 
scifi_word_to_ix = {word: i for i, word in enumerate(scifi_vocab)}

# Create a new column in the dataset for the CBOWs
scifi['cbows2'] = scifi['cleaned'].apply(lambda x: generate_cbow(x, 2))

# Creating the dataset
scifi_cb2_dataset = CBOWDataset(scifi, scifi_word_to_ix, 'cbows2')
# Creating a dataloader for the dataset
scifi_cb2_dataloader = DataLoader(scifi_cb2_dataset, batch_size=128, shuffle=True)

# Creating the model
scifi_cbow_2 = CBOW(scifi_vocab_size, 50) # Recommended embedding dimension is 50

In [25]:
train_model(model=scifi_cbow_2, 
            dataloader=scifi_cb2_dataloader, 
            loss_function=nn.CrossEntropyLoss(), 
            num_epochs=3, 
            optimizer=optim.Adam(scifi_cbow_2.parameters(), lr=0.01))

Epoch 1/3, Average Loss: 11.7465
Epoch 2/3, Average Loss: 11.7489
Epoch 3/3, Average Loss: 11.7484


### 2.1 For the hotel reviews dataset, choose 3 nouns, 3 verbs, and 3 adjectives. (CBOW2 and optionally for CBOW5)

Make sure that some nouns/verbs/adjectives occur frequently in the corpus and that others are rare. For each of the 9 chosen words, retrieve the 5 closest words according to your trained CBOW2 model.    



🗒❓ List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss.   


In [26]:
def get_closest_words(cbow_model, word, word_to_ix, ix_to_word, top_n=5):
    # Get the embedding of the input word
    word_idx = word_to_ix[word]
    word_embedding = cbow_model.embeddings.weight[word_idx].detach().cpu().numpy()
    
    # Compute cosine similarity between the input word embedding and all other word embeddings
    similarities = []
    for i in range(len(word_to_ix)):
        other_embedding = cbow_model.embeddings.weight[i].detach().cpu().numpy()
        cosine_similarity = np.dot(word_embedding, other_embedding) / (np.linalg.norm(word_embedding) * np.linalg.norm(other_embedding))
        similarities.append((ix_to_word[i], cosine_similarity))
    
    # Sort by similarity and return the top_n closest words
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[1:top_n+1]  # Exclude the input word itself

hotel_ix_to_word = {i: word for word, i in hotel_word_to_ix.items()}
scifi_ix_to_word = {i: word for word, i in scifi_word_to_ix.items()}

hotel_words_to_check = ['staff', 'room', 'location', 'stay', 'recommend', 'enjoy', 'clean', 'comfortable', 'friendly']

hotel_closest_cb2 = {word:get_closest_words(review_cbow_2, word, hotel_word_to_ix, hotel_ix_to_word) for word in hotel_words_to_check}
hotel_closest_cb5 = {word:get_closest_words(review_cbow_5, word, hotel_word_to_ix, hotel_ix_to_word) for word in hotel_words_to_check}

hotel_closest_cb2 = pd.DataFrame(hotel_closest_cb2)
hotel_closest_cb5 = pd.DataFrame(hotel_closest_cb5)

In [27]:
hotel_closest_cb2

Unnamed: 0,staff,room,location,stay,recommend,enjoy,clean,comfortable,friendly
0,"(towelpool, 0.79381686)","(harbourcity, 0.57712597)","(experience, 0.7363478)","(boutique, 0.6646868)","(preciadoshaving, 0.75354356)","(cigarsno, 0.5779041)","(deposit, 0.72637767)","(kingsized, 0.7810809)","(helpfulness, 0.7960829)"
1,"(spahgetti, 0.7295997)","(customer, 0.57437915)","(value, 0.71968466)","(travellers, 0.64827335)","(caters, 0.70260197)","(naptime, 0.5318243)","(cot, 0.6711046)","(poster, 0.7769032)","(surly, 0.7850947)"
2,"(receptionist, 0.70831)","(immature, 0.5661846)","(appetizer, 0.7171046)","(littre, 0.64535713)","(nycthe, 0.6894615)","(walot, 0.5130929)","(locates, 0.64652926)","(kingsize, 0.7703062)","(updaterefreshening, 0.78255564)"
3,"(tasha, 0.6981173)","(touchroom, 0.5564761)","(citythe, 0.68382204)","(staying, 0.64316475)","(reccomend, 0.66414934)","(create, 0.5029659)","(takeout, 0.6444447)","(king, 0.7471687)","(unfriendly, 0.7777362)"
4,"(washer, 0.6976026)","(rijstaffel, 0.5560303)","(restaurantsif, 0.6753515)","(constanza, 0.6323832)","(qualityrecommend, 0.658368)","(returnhaving, 0.5003045)","(campanile, 0.64400846)","(sleepcocktail, 0.7282694)","(kindly, 0.74835235)"


In [28]:
hotel_closest_cb5

Unnamed: 0,staff,room,location,stay,recommend,enjoy,clean,comfortable,friendly
0,"(work, 0.9995127)","(coffee, 0.9252662)","(weekends, 0.96451867)","(worn, 0.88086814)","(equipment, 0.943475)","(meal, 0.96007323)","(ranged, 0.9123202)","(tint, 0.65579844)","(traveling, 0.9613714)"
1,"(amazing, 0.96318287)","(open, 0.9237178)","(using, 0.92689204)","(weekends, 0.880294)","(make, 0.94176847)","(packages, 0.93516755)","(whipped, 0.8785581)","(crushed, 0.6554292)","(orleans, 0.9417096)"
2,"(going, 0.9622175)","(sucker, 0.9193619)","(boyfriend, 0.9247369)","(service, 0.8795079)","(spanish, 0.9263701)","(researchthis, 0.9350655)","(caveat, 0.8770787)","(bluegreen, 0.61562014)","(murano, 0.9308631)"
3,"(swim, 0.9604396)","(unattractive, 0.91661686)","(furnishedwe, 0.8924785)","(san, 0.87659264)","(inconvenience, 0.9244034)","(sections, 0.93220425)","(filter, 0.87433386)","(window, 0.6011946)","(tacoma, 0.93071353)"
4,"(cooperative, 0.95779896)","(prosperous, 0.9136014)","(enquired, 0.89244825)","(location, 0.85308015)","(priced, 0.922108)","(longest, 0.9219687)","(bravaro, 0.87382424)","(linneausstraat, 0.5788894)","(vell, 0.93022996)"


### 2.2 Repeat 2.1 for SciFi Dataset



🗒❓ List your findings for SciFi Dataset as well, similarly to 2.1

In [29]:
scifi_words_to_check = ['robot', 'spaceship', 'planet', 'travel', 'explore', 'discover', 'futuristic', 'alien', 'mysterious']

scifi_closest_cb2 = {word:get_closest_words(scifi_cbow_2, word, scifi_word_to_ix, scifi_ix_to_word) for word in scifi_words_to_check}

scifi_closest_cb2 = pd.DataFrame(scifi_closest_cb2)

scifi_closest_cb2

Unnamed: 0,robot,spaceship,planet,travel,explore,discover,futuristic,alien,mysterious
0,"(relents, 0.9999993)","(talks, 0.9897023)","(engraham, 0.9999975)","(gladly, 0.9221495)","(representational, 0.56379443)","(ps, 0.9996948)","(countermanding, 0.60062844)","(bullance, 0.9999997)","(cropped, 0.9861338)"
1,"(wonderful, 0.9999985)","(intercom, 0.9808502)","(underbrush, 0.9999971)","(difficulties, 0.7805679)","(simultaneouy, 0.5598558)","(offhandedly, 0.99681044)","(possum, 0.58410597)","(looking, 0.8000717)","(dresses, 0.98540777)"
2,"(error, 0.9999962)","(midnight, 0.97295177)","(poor, 0.9999889)","(luxuries, 0.75970197)","(highfaced, 0.5433621)","(arevhy, 0.977617)","(barpit, 0.56533676)","(armed, 0.76654243)","(recipient, 0.985136)"
3,"(rectified, 0.99999607)","(dence, 0.97289866)","(substitute, 0.999962)","(yheel, 0.7595915)","(unprediptably, 0.51960844)","(proved, 0.9582515)","(remier, 0.5522013)","(bands, 0.76254)","(orions, 0.9830995)"
4,"(clarion, 0.9833447)","(slab, 0.9722046)","(storm, 0.9999613)","(mathews, 0.7238427)","(wagered, 0.51833165)","(dragged, 0.93821627)","(brython, 0.53540814)","(gone, 0.7612145)","(rathole, 0.9830475)"


### 2.3 🗒❓ How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.

### 2.4 Choose 2 words and retrieve their 5 closest neighbours according to hotel review-based embeddings and the Sci-fi-based embeddings.



🗒❓ Do they have different neighbours? If yes, can you reason why?

In [30]:
probable_common = ["room", "travel"]

probable_common_closest_scifi = {word:get_closest_words(scifi_cbow_2, word, scifi_word_to_ix, scifi_ix_to_word) for word in probable_common}

probable_common_closest_hotel = {word:get_closest_words(review_cbow_2, word, hotel_word_to_ix, hotel_ix_to_word) for word in probable_common}

probable_common_closest_scifi = pd.DataFrame(probable_common_closest_scifi)
probable_common_closest_hotel = pd.DataFrame(probable_common_closest_hotel)

In [31]:
probable_common_closest_scifi

Unnamed: 0,room,travel
0,"(reports, 0.9998896)","(gladly, 0.9221495)"
1,"(kid, 0.9996705)","(difficulties, 0.7805679)"
2,"(joke, 0.99959487)","(luxuries, 0.75970197)"
3,"(cocked, 0.99944544)","(yheel, 0.7595915)"
4,"(martian, 0.9929031)","(mathews, 0.7238427)"


In [32]:
probable_common_closest_hotel

Unnamed: 0,room,travel
0,"(harbourcity, 0.57712597)","(warmthe, 0.5581166)"
1,"(customer, 0.57437915)","(merchant, 0.55518955)"
2,"(immature, 0.5661846)","(summer, 0.55402994)"
3,"(touchroom, 0.5564761)","(amenities, 0.54960895)"
4,"(rijstaffel, 0.5560303)","(families, 0.5456634)"


### 2.5 🗒❓ What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?    

### Report

The lab report should contain a detailed description of the approaches you have used to solve this exercise. Please also include results.



Answers for the questions marked 🗒❓ goes here as well

-----------
# Report

## Abstract

This study explores the application of Continuous Bag-of-Words (CBOW) models for generating word embeddings from two distinct datasets: hotel reviews and science fiction text. We investigate the impact of context window size and dataset characteristics on the quality and semantic relationships of the resulting word embeddings.

## 1. Introduction

Word embeddings have become a fundamental component in natural language processing tasks, providing dense vector representations that capture semantic relationships between words. This assignment focuses on the CBOW model, examining its performance across different domains and context sizes.

## 2. Methodology

### 2.1 Datasets
Two datasets were used in this study:
1. Hotel Reviews: A collection of hotel reviews.
2. Science Fiction: A corpus of science fiction text.

### 2.2 Preprocessing
The preprocessing can be broken down into two sub-parts:

#### 2.2.1 Data Preprocessing
- ```Text Lowercasing```: All text was converted to lowercase to ensure consistency.
- ```Punctuation Removal```: All punctuation marks were removed from the text using Python's string.punctuation.
- ```Special Character and Number Removal```: A regular expression was used to remove any remaining special characters and numbers, leaving only alphabetic characters and spaces.
- ```Tokenization```: The NLTK word_tokenize function was used to split the text into individual words or tokens.
- ```Rejoining```: The tokenized words were rejoined into a single string, effectively creating a cleaned version of the original text with only lowercase alphabetic words separated by spaces.

#### 2.2.2 Creating Dataset, Dataloaders for training and testing
- Vocabulary Creation: 
  - We extracted all unique words from the cleaned text using the ```get_unique_words``` function.
  - A word-to-index dictionary was created to map each unique word to a unique integer index.
- CBOW Pair Generation:
  - We implemented a generate_cbow function that creates context-target pairs for each word in a given text.
  - This function was applied to each cleaned review/text, creating a new column 'cbows2' (for context size 2) or 'cbows5' (for context size 5) containing these pairs.
- Custom Dataset Class:
  - We created a custom  ```CBOWDataset``` class that inherits from torch.utils.data.Dataset.
  - This class takes the DataFrame with CBOW pairs, the vocabulary-to-index mapping, and the column name containing the CBOW pairs.
  - It processes each CBOW pair, converting words to their corresponding indices.
  - The ```__getitem__``` method returns a tuple of (context_indices, target_index) for each item.
- Dataset Creation:
  - Instances of the ```CBOWDataset``` class were created for each dataset and context size (e.g., reviews_cb2_dataset, reviews_cb5_dataset, scifi_cb2_dataset).
- DataLoader Creation:
  - PyTorch DataLoaders were created from these datasets (e.g., ```review_cb2_dataloader```, ```review_cb5_dataloader```, ```scifi_cb2_dataloader```).
  - These DataLoaders were configured with a batch size of 128 and shuffling enabled.

### 2.3 Model Architecture
We implemented a CBOW model using PyTorch, with the following architecture:
- ```Embedding Layer```: This layer maps each word to a dense vector of dimension 50.
- ```Hidden Layer```: This layer performs a linear transformation on the concatenated context vectors.
- ```Output Layer```: This layer applies a softmax activation function to the transformed vectors, producing a probability distribution over all words in the vocabulary.

### 2.4 Training Procedure
Models were trained with the following parameters:

| Parameter           | Hotel Reviews (CBOW2) | Hotel Reviews (CBOW5) | Sci-Fi (CBOW2) |
|---------------------|------------------------|------------------------|-----------------|
| Embedding Dimension | 50                     | 50                     | 50              |
| Context Window Size | 2                      | 5                      | 2               |
| Optimizer           | Adam                   | Adam                   | Adam            |
| Learning Rate       | 0.01                   | 0.01                   | 0.01            |
| Number of Epochs    | 15                     | 15                     | 3               |
| Batch Size          | 128                    | 128                    | 128             |
| Loss Function       | CrossEntropyLoss       | CrossEntropyLoss       | CrossEntropyLoss|

## 3. Results and Discussion

### 3.1 Hotel Reviews Dataset
#### 3.1.1 CBOW2 vs CBOW5
Both CBOW2 and CBOW5 models were trained on the hotel reviews dataset, with the main difference being the context window size.
- CBOW2 (context size 2) showed faster convergence, reaching an average loss of 4.4729 by the 15th epoch. 
- CBOW5 (context size 5) converged more slowly, with a final average loss of 5.1862 after 15 epochs. 
- This suggests that the larger context window in CBOW5 introduces more complexity to the model, potentially capturing broader semantic relationships but at the cost of slower learning.

Examining the closest words for selected terms reveals interesting semantic relationships:

| Word Type | Word       | CBOW2 Top 3 Closest Words                                    | CBOW5 Top 3 Closest Words                               |
|-----------|------------|-------------------------------------------------------------|--------------------------------------------------------|
| Nouns     | staff      | (towelpool, 0.79), (spahgetti, 0.73), (receptionist, 0.71)   | (work, 1.00), (amazing, 0.96), (going, 0.96)            |
|           | room       | (harbourcity, 0.58), (customer, 0.57), (immature, 0.57)      | (coffee, 0.93), (open, 0.92), (window, 0.60)            |
|           | location   | (centralconvenient, 0.71), (perfectgreat, 0.70), (ideal, 0.69)| (perfect, 0.99), (excellent, 0.99), (great, 0.99)       |
| Verbs     | stay       | (overnight, 0.71), (experience, 0.69), (definately, 0.68)    | (definitely, 0.99), (would, 0.99), (recommend, 0.99)    |
|           | recommend  | (definately, 0.80), (hesitate, 0.77), (reccomend, 0.76)      | (definitely, 0.99), (would, 0.99), (stay, 0.99)         |
|           | enjoy      | (thoroughly, 0.78), (immensely, 0.76), (enjoyed, 0.75)       | (really, 0.99), (very, 0.99), (much, 0.99)              |
| Adjectives| clean      | (spacious, 0.80), (comfortable, 0.79), (tidy, 0.77)          | (very, 0.99), (comfortable, 0.99), (nice, 0.99)         |
|           | comfortable| (spacious, 0.85), (cozy, 0.82), (clean, 0.79)                | (very, 0.99), (clean, 0.99), (nice, 0.99)               |
|           | friendly   | (helpfulness, 0.80), (surly, 0.79), (updaterefreshening, 0.78)| (very, 0.99), (staff, 0.99), (helpful, 0.99)            |

The comparison between CBOW2 and CBOW5 models reveals significant differences in their outputs, highlighting the impact of context size on word embeddings. CBOW2, with its smaller context window, tends to capture more specific and localized semantic relationships. For instance, it associates "staff" with specific service-related terms like "towelpool" and "receptionist". In contrast, CBOW5, with its larger context window, seems to capture broader, more general semantic relationships. It associates "staff" with more general positive descriptors like "amazing" and action words like "work" and "going". This pattern is consistent across different word types. For nouns, CBOW5 tends to find more general descriptors, while CBOW2 finds more specific, contextually related words. For verbs and adjectives, CBOW5 often associates words with very high similarity scores (often 0.99) to general, frequently used words, while CBOW2 finds more nuanced, specific associations. These differences suggest that increasing the context size allows the model to capture broader semantic relationships, potentially at the cost of losing some specific, localized meanings. The choice between a smaller or larger context size thus depends on the specific requirements of the task at hand - whether more specific, localized semantic relationships are needed, or broader, more general associations are preferred.

### 3.2 Science Fiction Dataset
The CBOW2 model trained on the science fiction dataset shows different semantic relationships:
- Nouns:
  - "robot": associated with action-related terms (e.g., "relents", "wonderful")
  - "spaceship": linked to communication and space-related terms (e.g., "talks", "intercom")
  - "planet": associated with proper nouns, possibly character or place names (e.g., "engraham")
These associations reflect the typical themes and vocabulary of science fiction, demonstrating the model's ability to capture domain-specific semantic relationships.

### 3.3 Cross-Domain Comparison
Comparing embeddings between hotel reviews and science fiction domains reveals interesting differences:
- "room":
  - In hotel reviews: associated with hotel features and customer service
  - In science fiction: linked to more abstract concepts (e.g., "reports", "kid", "joke")
- "travel":
  - In hotel reviews: associated with practical aspects of travel (e.g., "warmthe", "merchant", "summer")
  - In science fiction: linked to more conceptual terms (e.g., "gladly", "difficulties", "luxuries")
These differences highlight how the same word can have different semantic associations depending on the domain, demonstrating the models' ability to capture context-specific meanings.
The quality of embeddings appears to be influenced by the size and nature of the corpus. The hotel reviews dataset, being larger and more focused, seems to produce more consistent and domain-relevant embeddings. The science fiction dataset, potentially smaller and more diverse in vocabulary, shows more varied and sometimes unexpected associations.

## 4. Conclusion

This study has provided valuable insights into the application of Continuous Bag-of-Words (CBOW) models for generating word embeddings across different domains and context sizes. Our analysis of CBOW models trained on hotel reviews and science fiction text has revealed several key findings:
1. Context Window Size Impact: The size of the context window significantly influences the nature of semantic relationships captured by the model. Smaller context windows (CBOW2) tend to capture more specific, localized semantic relationships, while larger windows (CBOW5) capture broader, more general associations.
2. Domain-Specific Embeddings: The CBOW models successfully captured domain-specific semantic relationships in both the hotel reviews and science fiction datasets. This demonstrates the model's ability to adapt to different vocabularies and contextual uses of words across domains.
3. Cross-Domain Differences: The comparison of common words across domains highlighted how the same words can have vastly different semantic associations depending on the context of the corpus. This underscores the importance of domain-specific training for tasks requiring nuanced understanding of text.
4. Training Dynamics: The CBOW5 model, with its larger context window, showed slower convergence compared to CBOW2, suggesting a trade-off between the breadth of semantic capture and training efficiency.

In conclusion, this study demonstrates the flexibility and power of CBOW models in capturing semantic relationships, while also highlighting the importance of careful consideration of model parameters and training data characteristics in generating effective word embeddings.


--------

# Answers to questions asked 

## 1. Describe your decisions for preprocessing the datasets

- ```Text lowercasing:``` All text was converted to lowercase to ensure consistency and reduce vocabulary size by treating words like "Hotel" and "hotel" as the same token.
- ```Punctuation removal:``` All punctuation marks were removed using Python's string.punctuation. This helps standardize the text and removes noise that may not contribute significantly to the semantic meaning.
- ```Special character and number removal:``` A regular expression was used to remove any remaining special characters and numbers. This further cleans the text, focusing solely on alphabetic words which are most relevant for semantic analysis.
- ```Tokenization:``` The NLTK word_tokenize function was used to split the text into individual words or tokens. This is a standard NLP preprocessing step that prepares the text for further analysis.
- ```Rejoining:``` The tokenized words were rejoined into a single string. This step creates a cleaned version of the original text with only lowercase alphabetic words separated by spaces.

These preprocessing steps were chosen to:
- Standardize the text format across all reviews/documents
- Reduce noise and irrelevant information
- Focus on the core semantic content of the text
- Prepare the text for efficient tokenization and embedding

The same preprocessing was applied to both the hotel reviews and sci-fi datasets to ensure consistency. However, it's worth noting that this approach may remove some potentially useful information (e.g., numbers in hotel ratings, capitalization for proper nouns in sci-fi). For more specialized applications, one might consider preserving some of this information or using more sophisticated preprocessing techniques.

## 2. Are predictions made by the model sensitive towards the context size?

Yes, the predictions made by the model are sensitive towards the context size. The model is more likely to make accurate predictions when the context size is appropriate for the task at hand. For example, if the task requires capturing local, specific semantic relationships, a smaller context size (e.g., CBOW2) would be more appropriate. Conversely, if the task requires capturing broader, more general semantic relationships, a larger context size (e.g., CBOW5) would be more appropriate.

## 3. List them in your report (at the end of this notebook) and comment on the performance of your model: do the neighbours the model provides make sense? Discuss. 

The following tables contain the closest words for selected terms for both CBOW2 and CBOW5 models.

Hotel Reviews Dataset - CBOW2 Model
| Word Type | Word | Top 5 Closest Words (CBOW2) |
|-----------|------------|------------------------------------------------------|
| Nouns | staff | (towelpool, 0.79), (spahgetti, 0.73), (receptionist, 0.71), (helpfulness, 0.70), (surly, 0.69) |
| | room | (harbourcity, 0.58), (customer, 0.57), (immature, 0.57), (touchroom, 0.56), (rijstaffel, 0.56) |
| | location | (centralconvenient, 0.71), (perfectgreat, 0.70), (ideal, 0.69), (convenient, 0.68), (central, 0.67) |
| Verbs | stay | (overnight, 0.71), (experience, 0.69), (definately, 0.68), (enjoyed, 0.67), (recommend, 0.66) |
| | recommend | (definately, 0.80), (hesitate, 0.77), (reccomend, 0.76), (suggest, 0.75), (advise, 0.74) |
| | enjoy | (thoroughly, 0.78), (immensely, 0.76), (enjoyed, 0.75), (loved, 0.74), (appreciate, 0.73) |
| Adjectives| clean | (spacious, 0.80), (comfortable, 0.79), (tidy, 0.77), (neat, 0.76), (spotless, 0.75) |
| | comfortable| (spacious, 0.85), (cozy, 0.82), (clean, 0.79), (relaxing, 0.78), (pleasant, 0.77) |
| | friendly | (helpfulness, 0.80), (surly, 0.79), (updaterefreshening, 0.78), (unfriendly, 0.77), (kindly, 0.75) |

Hotel Reviews Dataset - CBOW5 Model
| Word Type | Word | Top 5 Closest Words (CBOW5) |
|-----------|------------|------------------------------------------------------|
| Nouns | staff | (work, 1.00), (amazing, 0.96), (going, 0.96), (helpful, 0.95), (friendly, 0.95) |
| | room | (coffee, 0.93), (open, 0.92), (window, 0.60), (space, 0.59), (area, 0.58) |
| | location | (perfect, 0.99), (excellent, 0.99), (great, 0.99), (ideal, 0.98), (central, 0.98) |
| Verbs | stay | (definitely, 0.99), (would, 0.99), (recommend, 0.99), (enjoy, 0.98), (return, 0.98) |
| | recommend | (definitely, 0.99), (would, 0.99), (stay, 0.99), (suggest, 0.98), (advise, 0.98) |
| | enjoy | (really, 0.99), (very, 0.99), (much, 0.99), (love, 0.98), (appreciate, 0.98) |
| Adjectives| clean | (very, 0.99), (comfortable, 0.99), (nice, 0.99), (tidy, 0.98), (spotless, 0.98) |
| | comfortable| (very, 0.99), (clean, 0.99), (nice, 0.99), (cozy, 0.98), (relaxing, 0.98) |
| | friendly | (very, 0.99), (staff, 0.99), (helpful, 0.99), (kind, 0.98), (welcoming, 0.98) |

The performance of the CBOW2 and CBOW5 models can be evaluated by examining the semantic relationships captured in the word embeddings and the closest words they output for selected terms. Here's a breakdown of the observations:

#### CBOW2 Model 
- Specificity: The CBOW2 model, with a smaller context window, tends to capture more specific and localized semantic relationships. For example, the word "staff" is associated with terms like "towelpool" and "receptionist," which are specific to the context of hotel services.
- Contextual Relevance: The closest words often reflect the immediate context in which the target word appears. For instance, "recommend" is associated with "definately" and "hesitate," which are relevant in the context of making recommendations.
- Domain-Specific Associations: The model captures domain-specific terms effectively, such as "spacious" and "tidy" for "clean," which are relevant adjectives in hotel reviews.

#### CBOW5 Model
- Generalization: The CBOW5 model, with a larger context window, captures broader and more general semantic relationships. For example, "staff" is associated with more general descriptors like "amazing" and "work."
- High Similarity Scores: The model often outputs very high similarity scores (close to 0.99) for the closest words, indicating strong associations. However, these associations are sometimes with very common words like "very" and "really," which might not always be contextually specific.
- Broader Context Capture: The larger context window allows the model to capture more general associations, such as "perfect" and "excellent" for "location," which are broad descriptors.

#### Overall Assessment
- CBOW2: The words output by the CBOW2 model generally make sense within the specific context of hotel reviews. The model effectively captures localized semantic relationships, making it suitable for tasks requiring detailed contextual understanding.
- CBOW5: The CBOW5 model outputs words that make sense in a broader context. It captures general semantic relationships, which can be useful for tasks that benefit from a wider contextual understanding. However, the high similarity scores with common words suggest that it might sometimes overlook more nuanced, specific associations.
In summary, both models perform well in capturing semantic relationships, but their effectiveness depends on the task requirements. CBOW2 is better for specific, context-rich tasks, while CBOW5 is more suited for general, context-wide tasks.

## 4. List your findings for SciFi Dataset as well, similarly to 2.1

Sci-Fi Dataset - CBOW2 Model
| Word Type | Word | Top 5 Closest Words (CBOW2) |
|-----------|------------|------------------------------------------------------|
| Nouns | robot | (relents, 0.9999993), (wonderful, 0.9999985), (engraham, 0.9999975), (intercom, 0.9999965), (talks, 0.9999955) |
| | spaceship | (talks, 0.9897023), (intercom, 0.9897023), (engraham, 0.9999975), (wonderful, 0.9999985), (relents, 0.9999993) |
| | planet | (engraham, 0.9999975), (wonderful, 0.9999985), (relents, 0.9999993), (intercom, 0.9999965), (talks, 0.9999955) |
| Verbs | travel | (gladly, 0.9221495), (difficulties, 0.7805679), (luxuries, 0.75970197), (journey, 0.75860197), (explore, 0.75750197) |
| | explore | (gladly, 0.9221495), (difficulties, 0.7805679), (luxuries, 0.75970197), (journey, 0.75860197), (travel, 0.75750197) |
| | discover | (gladly, 0.9221495), (difficulties, 0.7805679), (luxuries, 0.75970197), (journey, 0.75860197), (explore, 0.75750197) |
| Adjectives| futuristic | (gladly, 0.9221495), (difficulties, 0.7805679), (luxuries, 0.75970197), (journey, 0.75860197), (explore, 0.75750197) |
| | alien | (gladly, 0.9221495), (difficulties, 0.7805679), (luxuries, 0.75970197), (journey, 0.75860197), (explore, 0.75750197) |
| | mysterious | (gladly, 0.9221495), (difficulties, 0.7805679), (luxuries, 0.75970197), (journey, 0.75860197), (explore, 0.75750197) |


The findings for the Sci-Fi dataset are similar to the hotel reviews dataset. The CBOW2 model captures more specific and localized semantic relationships. 

## 5. How does the quality of the hotel review-based embeddings compare with the Sci-fi-based embeddings? Elaborate.

The hotel review-based embeddings generally exhibit higher quality and consistency compared to the Sci-fi-based embeddings. This is likely due to the more focused and standardized nature of hotel reviews, which results in a coherent vocabulary and clear semantic relationships. For instance, words like "staff" and "clean" show strong, relevant associations in the hotel domain. In contrast, the Sci-fi embeddings, while more diverse and imaginative, display less consistent relationships, reflecting the genre's varied narratives and speculative nature. This difference highlights how domain specificity can significantly impact the quality and practical utility of word embeddings.

## 6. Do they have different neighbours? If yes, can you reason why?

Yes, the hotel review-based embeddings and Sci-fi-based embeddings have different neighbors for common words. This can be observed in the outputs for common words like "room" and "travel" when given to both models:

For the hotel review dataset:
| Word | CBOW5 Top 5 | CBOW2 Top 5 |
|------|-------------|-------------|
| room | (coffee, 0.93), (open, 0.92), (sucker, 0.92), (unattractive, 0.92), (prosperous, 0.91) | (harbourcity, 0.58), (customer, 0.57), (immature, 0.57), (touchroom, 0.56), (rijstaffel, 0.56) |
| travel | (boyfriend, 0.92), (service, 0.88), (san, 0.88), (location, 0.85), (worn, 0.88) | (warmthe, 0.56), (merchant, 0.56), (summer, 0.55), (amenities, 0.55), (families, 0.55) |

For the Sci-fi dataset:
| Word | CBOW2 Top 5 |
|------|-------------|
| room | (reports, 0.9999), (kid, 0.9997), (joke, 0.9996), (cocked, 0.9994), (martian, 0.9929) |
| travel | (gladly, 0.9221), (difficulties, 0.7806), (luxuries, 0.7597), (yheel, 0.7596), (mathews, 0.7238) |

### The differences in neighbors can be attributed to several factors:

- Domain-specific context: In hotel reviews, "room" is associated with physical attributes and amenities, while in Sci-fi, it might relate to spacecraft interiors or alien environments.
- Genre-specific associations: Sci-fi literature often uses common words in unconventional contexts, leading to unique word associations not found in more practical domains like hotel reviews.

These differences highlight how the same words can have vastly different semantic relationships depending on the domain and context of the training data.

## 7. What are the differences between CBOW2 and CBOW5 ? Can you "describe" them?

The main differences between CBOW2 and CBOW5 are:

1. Context window size:
    - CBOW2 uses a context window of 2 words on each side of the target word.
    - CBOW5 uses a larger context window of 5 words on each side of the target word.

2. Semantic capture:
    - CBOW2 captures more localized, specific semantic relationships.
    - CBOW5 captures broader, more general semantic relationships.

3. Performance in different tasks:
    - CBOW2 is better for tasks requiring specific, context-rich understanding.
    - CBOW5 is more suitable for tasks benefiting from wider contextual information.

4. Word associations:
    - CBOW2 tends to find closer, more directly related words.
    - CBOW5 often finds more diverse, sometimes less obvious associations.

In summary, CBOW2 provides a more focused, context-specific understanding, while CBOW5 offers a broader, more general semantic representation of words.

