# A3: Word Embeddings and Language Modelling

Created by Adam Ek, modified by Ricardo Muñoz Sánchez and Simon Dobnik

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.

In this lab we will explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch.
    * You can install it using the instructions from here: https://pytorch.org/
    * If you would like to check out some tutorials on how to use it, you can can do so here: https://pytorch.org/tutorials/beginner/basics/intro.html
    * Some basic operations that will be useful for you can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* We are not interested in getting state-of-the-art performance, focus on the implementation and not results of your model.
    * For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so.
    * On linux or mac you can use: ```head -n 10000 inputfile > outputfile```. 
* Using GPUs will make things run faster.
    * You can access the server by using SSH: ```ssh -L 8888:localhost:8888 [your_x_account]@mltgpu.flov.gu.se -p 62266```
        * ```ssh``` tells the computer to connect remotely to the server.
        * ```-L 8888:localhost:8888``` allows you to connect using jupyter notebooks, you can remove it if you don't want to do that.
        * ```-p 62266``` tells the server to give you access through port 62266.
    * You can also connect to the server using VSCode, available for Mac, Linux, and Windows.
    * I would suggest you to set up a virtual environment on the server, such as virtual env or conda.
    * When using pytorch on the server, remember to install the GPU-compatible version!
    * You can also use Google Collab for free (with a monthly quota for GPU usage). We highly suggest you to use the MLT server instead, though.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# If you're using GPUs, replace "cpu" with "cuda:n" where n is the index of the GPU
# device = torch.device('cpu')
device = torch.device('cuda:3') # nvidia-smi

# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data from data/wiki-corpus.50000.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the target word in one column and the context words in another (separate the columns with ```tab```). ```window_size=n``` means that we select ```n/2``` tokens to the right and left of the center word.

For example, the sentece "this is a lab and exercise" with ```window size = 4``` will be converted to 6 (target, context) pairs:
```
target      context
----------------------------
this        is, a
is          this, a, lab
a           this, is, lab
lab         is, a, and, exercise
and         a, lab, exercise
exercise    lab, and 
```

this will be our training examples for the word2vec model.

[3 marks]

In [2]:

# data_path = './data/test.wiki-corpus.10000.txt'
data_path = './data/wiki-corpus.50000.txt'
WINDOW_SIZE = 4
''' 
def corpus_reader(data_path, window_size=4, min_freq=4):
    all_data = []
    vocabulary = set(['<pad>'])
    with open(data_path) as f:
        # go over the lines (sentences in the files)
        ...
        # split sentences into tokens
        ...
        # save all indiviual words to the vocabulary
        ...
        # extract all (center word, context) with `window_size=4`, pairs from the sentence
        ...
        # save (center word, context) pairs into a dataset
        ...
    
    # filter out words which does not occur often
    ...
    
    # create a mapping from words to integers. 
    # each word should have an unique integer mapped to it. 
    # use a dictionary for this.
    word_to_idx = ...
    return all_data, word_to_idx
'''

def corpus_reader(data_path, window_size=4, min_freq=4):
    all_data = []
    vocabulary = set(['<pad>'])
    word_freq = {}

    # First pass: build vocabulary and count word frequencies
    with open(data_path, 'r', encoding='utf-8') as f:
        sentences = []
        for line in f:
            tokens = line.strip().split()
            sentences.append(tokens)
            for word in tokens:
                vocabulary.add(word)
                word_freq[word] = word_freq.get(word, 0) + 1

    # Filter out infrequent words
    vocabulary = {word for word in vocabulary if word_freq.get(word, 0) >= min_freq or word == '<pad>'}

    # Second pass: extract (center, context) pairs
    for tokens in sentences:
        tokens = [word for word in tokens if word in vocabulary]
        for idx, center_word in enumerate(tokens):
            context = []
            for i in range(max(0, idx - window_size // 2), min(len(tokens), idx + window_size // 2 + 1)):
                if i != idx:
                    context.append(tokens[i])
            if context:
                all_data.append((center_word, context))

    # Create word to index mapping
    word_to_idx = {word: idx for idx, word in enumerate(sorted(vocabulary))}

    return all_data, word_to_idx

We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

Random sampling ensures a broad and diverse vocabulary and topics, helping the model generalize better across different types of language and contexts. Random selection also reduces the chance of introducing topic or style bias that might happen if you only select certain categories or types of articles.
But random data can include a lot of uncommon, technical, or noisy sentences that don't contribute much to learning general word relationships. Also if our downstream task is focused on a specific domain (e.g., medical text), random Wikipedia data might not provide relevant context.

### Loading the data

We need to create a dataloader now. That is, some way of generating a batch of examples from the dataset. A batch is a set of ```n``` examples from the data.

The recipe for a dataloader is as follows:

* Select n examples from the dataset
* (a) Translate each example into integers using `word_to_idx`
* (b) Transform the translated examples to pytorch tensors
* (c) Return the batch 
* Select n new examples from the dataset
* ... repeat steps (a-c)

The dataloader should stop when it have read the whole dataset.

This can be done either by first computing all the batches in the dataset and returning it as a list which you can then iterate over, or as an generator that returns each batch after it has been created.

[4 marks]

In [4]:
from collections import namedtuple
Batch = namedtuple('Batch', ['target_word', 'context'])

'''
def batcher(dataset, word_to_idx, batch_size=8):
    # iterate over the dataset
    
    # select a batch of size `batch_size`
    
    # translate batch to integers using `word_to_idx`
    
    # add padding to the context
    
    # transform the batch to a pytorch tensor
    
    # return the dataset of batches/indiviual batches 
    batch = Batch(target_word, context)
'''

def batcher(dataset, word_to_idx, batch_size=8):
    # Helper function to pad context lists
    def pad_contexts(contexts, pad_value):
        max_len = max(len(c) for c in contexts)
        padded = []
        for c in contexts:
            padded.append(c + [pad_value] * (max_len - len(c)))
        return padded

    # Go through the dataset in steps of batch_size
    for i in range(0, len(dataset), batch_size):
        batch_samples = dataset[i:i + batch_size]

        # Convert target words and contexts to indices
        target_word_indices = [word_to_idx[target] for target, _ in batch_samples]
        context_indices = [[word_to_idx[word] for word in context] for _, context in batch_samples]

        # Pad context lists so they're all the same length
        context_indices_padded = pad_contexts(context_indices, word_to_idx['<pad>'])

        # Convert to tensors
        target_tensor = torch.tensor(target_word_indices, dtype=torch.long)
        context_tensor = torch.tensor(context_indices_padded, dtype=torch.long)

        # Create a batch and yield it
        batch = Batch(target_word=target_tensor, context=context_tensor)
        yield batch

We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

Lower-casing can reduces vocabulary size. By treating "Apple" and "apple" as the same word, we avoid data sparsity and make the model easier to train, especially for rare capitalized forms. It can also handles inconsistent capitalization. In real-world text, the same word might appear capitalized or lowercase inconsistently.
Lower-casing may also be harmful. Like it may loss some meaning. Capitalization sometimes conveys important information (e.g., "Apple" the company vs. "apple" the fruit). Lowercasing removes this distinction. Proper nouns and acronyms may lose their identity, which could degrade the quality of embeddings in tasks needing entity recognition.

## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In [5]:
import torch.optim as optim

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [6]:
'''
class CBOWModel(nn.Module):
    def __init__(self, ...):
        super(CBOWModel, self).__init__()
        # where the embeddings of words are stored 
        # each word in the vocabulary should have one embedding assigned to it
        self.embeddings = ...
        # a transformation that predicts a word from the vocabulary
        self.prediction = ...
    
    def forward(self, context):
        # translate a batch to embeddings
        embedded_context = ...
        # reduce dimensions of the embeddings
        projection = ...
        # predict the target word from the vocabulary
        predictions = ...
        
        return predictions
        
    def projection_function(self, xs):
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """
        xs_sum = ...
        return xs_sum
'''

import torch
import torch.nn as nn

class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOWModel, self).__init__()
        # Embedding layer: each word gets a vector of size embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Linear layer to project from embedding space back to vocabulary space
        self.prediction = nn.Linear(embedding_dim, vocab_size)

    def forward(self, context):
        """
        context: Tensor of shape (batch_size, context_size)
        """
        # Look up embeddings for context words
        embedded_context = self.embeddings(context)  # shape: (batch_size, context_size, embedding_dim)
        
        # Reduce dimension: sum over context words
        projection = self.projection_function(context, embedded_context)  # shape: (batch_size, embedding_dim)
        
        # Predict the center word
        predictions = self.prediction(projection)  # shape: (batch_size, vocab_size)
        
        return predictions

    def projection_function(self, context, xs):
        """
        xs: Tensor of shape (batch_size, context_size, embedding_dim)
        Return: Tensor of shape (batch_size, embedding_dim)
        """
        # Sum the embeddings across the context dimension (dim=1)
        xs_sum = xs.sum(dim=1)
        return xs_sum


Now we need to train the models. First we define which hyperparameters to use. (You can change these, for example when *developing* your model you can use a batch size of 2 and a very low dimensionality (say 10), just to speed things up). When actually training your model *fo real*, you can use a batch size of [8,16,32,64], and embedding dimensionality of [128,256].

In [7]:
word_embeddings_hyperparameters = {'epochs':10,
                                   'batch_size':8,
                                   'learning_rate':0.001,
                                   'embedding_dim':128}

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [8]:
'''
# load data
dataset, vocab = get_data(...)

# build model and construct loss/optimizer
cbow_model = CBOWModel(len(vocab), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['lr'])

# start training loop
total_loss = 0
for epoch in range(word_embeddings_hyperparameters['epochs']):
    for i, batch in enumerate(dataset):
        
        context = batch.context
        target_word = batch.target_word
        
        # send your batch of sentences to the model
        output = cbow_model(context)
        
        # compute the loss, you'll need to reshape the input
        # you can read more about this is the documentation for
        # CrossEntropyLoss
        loss = loss_fn(...)
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r') 
        
        # compute gradients
        ...
        
        # update parameters
        ...
        
        # reset gradients
        ...
    print()
'''


import torch.optim as optim

import random
def get_data(data_path='wiki-corpus.txt', window_size=4, min_freq=4, batch_size=16, shuffle=True):
    all_data, word_to_idx = corpus_reader(data_path, window_size=window_size, min_freq=min_freq)

    if shuffle:
        random.shuffle(all_data)

    batches = list(batcher(all_data, word_to_idx, batch_size=batch_size))

    return batches, word_to_idx

# load data
dataset, vocab = get_data(
    data_path=data_path,
    window_size=4,
    min_freq=4,
    batch_size=word_embeddings_hyperparameters['batch_size'],
    shuffle=True
)

# build model and construct loss/optimizer
cbow_model = CBOWModel(len(vocab), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['learning_rate'])

# start training loop
for epoch in range(word_embeddings_hyperparameters['epochs']):
    total_loss = 0

    for i, batch in enumerate(dataset):
        context = batch.context.to(device)       # move to GPU/CPU
        target_word = batch.target_word.to(device)

        # forward pass
        output = cbow_model(context)  # shape: (batch_size, vocab_size)

        # compute the loss (CrossEntropy expects (batch_size, vocab_size) and target as (batch_size))
        loss = loss_fn(output, target_word)

        # backward pass
        optimizer.zero_grad()  # reset gradients
        loss.backward()        # compute gradients
        optimizer.step()       # update parameters

        total_loss += loss.item()

    # print average epoch loss
    avg_loss = total_loss / (i + 1)
    print(f"Epoch {epoch+1}, Average Loss: {avg_loss:.4f}")

Epoch 1, Average Loss: 6.5166
Epoch 2, Average Loss: 6.2189
Epoch 3, Average Loss: 6.0259
Epoch 4, Average Loss: 5.8948
Epoch 5, Average Loss: 5.8024
Epoch 6, Average Loss: 5.7349
Epoch 7, Average Loss: 5.6850
Epoch 8, Average Loss: 5.6485
Epoch 9, Average Loss: 5.6219
Epoch 10, Average Loss: 5.5969


In [16]:
# Save the model
torch.save(cbow_model.state_dict(), 'cbow_model.pth')
# Save the vocabulary
with open('vocab.txt', 'w') as f:
    for word, idx in vocab.items():
        f.write(f"{word}\t{idx}\n")

# Load the model
loaded_cbow_model = CBOWModel(len(vocab), word_embeddings_hyperparameters['embedding_dim'])
# loaded_cbow_model.load_state_dict(torch.load('cbow_model.pth'))
state_dict = torch.load('cbow_model.pth', weights_only=True)  # PyTorch 2.2+
loaded_cbow_model.load_state_dict(state_dict)

loaded_cbow_model.to(device)
loaded_cbow_model.eval()  # Set the model to evaluation mode
# Load the vocabulary
loaded_cbow_vocab = {}
with open('vocab.txt', 'r') as f:
    for line in f:
        word, idx = line.strip().split('\t')
        loaded_cbow_vocab[word] = int(idx)
# Check if the loaded model and vocabulary are the same as the original
for key in cbow_model.state_dict():
    if not torch.equal(cbow_model.state_dict()[key], loaded_cbow_model.state_dict()[key]):
        print(f"Mismatch in parameter: {key}")
        assert False, "Model state dicts do not match!"
assert vocab == loaded_cbow_vocab, "Vocabulary does not match!"

## Evaluating the model

We will evaluate the model on a dataset of word similarities, WordSim353 (http://alfonseca.org/eng/research/wordsim353.html , also avalable in the data folder). The first thing we need to do is read the dataset and translate it to integers. What we'll do is to reuse the ```Field``` that records word indexes (the second output of ```get_data()```) and use it to parse the file.

The wordsim data is structured as follows:

```
word1 word2 score
...
```


The ```Field``` we got from ```read_data()``` has two built-in functions, ```stoi``` which maps a string to an integer and ```itos``` which maps an integer to a string. 

What our datareader needs to do is: 

```
for line in file:
    word1, word2, score = file.split()
    # encode word1 and word2 as integers
    word1_idx = vocab.vocab.stoi[word1]
    word2_idx = vocab.vocab.stoi[word2]
```

when we have the integers for ```word_1``` and ```word2``` we'll compute the similarity between their word embeddings with *cosine simlarity*. We can obtain the embeddings by querying the embedding layer of the model.

We calculate the cosine similarity for each word pair in the dataset, then compute the pearson correlation between the similarities we obtained with the scores given in the dataset. 

[4 marks]

In [17]:
''' 
def read_wordsim(path, vocab, embeddings):
    dataset_sims = []
    model_sims = []
    with open(path) as f:
        for line in f:
            word1, word2, score = f.split()
            
            score = float(score)
            dataset_sims.append(score)
            
            # get the index for the word
            word1_idx = ...
            word2_idx = ...
            
            # get the embedding of the word
            word1_emb = ...
            word2_emb = ...
            
            # compute cosine similarity, we'll use the version included in pytorch functional
            # https://pytorch.org/docs/master/generated/torch.nn.functional.cosine_similarity.html
            cosine_similarity = F.cosine_similarity(...)
            
            model_sims.append(cosine_similarity.item())
    
    return dataset_sims, model_sims

path = 'wordsim_similarity_goldstandard.txt'
data, model = read_wordsim(...)
pearson_correlation = np.corrcoef(data, model)
            
# the non-diagonals give the pearson correlation,
print(pearson_correlation)
'''

import torch.nn.functional as F
import numpy as np
from scipy import stats

def read_wordsim(path, vocab, embeddings):
    dataset_sims = []
    model_sims = []

    with open(path) as f:
        for line in f:
            word1, word2, score = line.split()
            score = float(score)
            dataset_sims.append(score)

            # Get indices for word1 and word2, or skip if not in vocab
            if word1 not in vocab or word2 not in vocab:
                continue

            word1_idx = vocab[word1]
            word2_idx = vocab[word2]
            

            # Get embeddings for both words
            # word1_emb = embeddings(word1_idx).unsqueeze(0)  # shape (1, D)
            # word2_emb = embeddings(word2_idx).unsqueeze(0)  # shape (1, D)
            word1_emb = embeddings(torch.tensor([word1_idx], device=device))
            word2_emb = embeddings(torch.tensor([word2_idx], device=device))

            # Compute cosine similarity
            cosine_similarity = F.cosine_similarity(word1_emb, word2_emb).item()
            model_sims.append(cosine_similarity)

    return dataset_sims[:len(model_sims)], model_sims  # align lengths in case of missing words


path_wordsim_similarity = './data/wordsim_similarity_goldstandard.txt'  # or the correct path to your wordsim file

# data_sim, model_sim = read_wordsim(path_wordsim_similarity, vocab, cbow_model.embeddings)
data_sim, model_sim = read_wordsim(path_wordsim_similarity, loaded_cbow_vocab, loaded_cbow_model.embeddings)

# Pearson correlation
pearson_correlation = np.corrcoef(data_sim, model_sim)[0, 1]
print(f"Pearson correlation between human scores and model similarities: {pearson_correlation:.4f}")

rho, pval = stats.spearmanr(data_sim, model_sim)
print(f"Spearman correlation between human scores and model similarities: {rho:.4f}, p-value: {pval:.4f}")

Pearson correlation between human scores and model similarities: 0.2884
Spearman correlation between human scores and model similarities: 0.2880, p-value: 0.0005


Do you think the model performs good or bad? Why?

[3 marks]

The model’s performance depends on the Pearson correlation we obtained. If the correlation is high (e.g., > 0.5), it means the embeddings capture semantic similarity well and the model performs decently. However, Word2Vec with a small dataset and simple CBOW architecture often achieves moderate correlation (0.2 to 0.3). Limitations like small training data, small embedding size, and lack of subword information can reduce performance. So, the model likely performs moderately, not as well as more advanced models like BERT or GloVe trained on massive corpora.

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

In [21]:
def evaluate_word_pairs(wordsim_path, vocab, model_embeddings):
    word_pairs = []
    dataset_sims = []
    model_sims = []

    with open(wordsim_path) as f:
        for line in f:
            word1, word2, score = line.split()
            score = float(score)

            # Only use pairs where both words are in the vocabulary
            if word1 not in vocab or word2 not in vocab:
                continue

            word1_idx = vocab[word1]
            word2_idx = vocab[word2]

            word1_emb = model_embeddings(torch.tensor([word1_idx], device=device))
            word2_emb = model_embeddings(torch.tensor([word2_idx], device=device))

            cosine_similarity = F.cosine_similarity(word1_emb, word2_emb).item()

            word_pairs.append((word1, word2))
            dataset_sims.append(score)
            model_sims.append(cosine_similarity)

    results = list(zip(word_pairs, dataset_sims, model_sims))

    # Sort by smallest difference (best performance)
    best_10 = sorted(results, key=lambda x: x[2], reverse=True)[:10]

    # Sort by largest difference (worst performance)
    worst_10 = sorted(results, key=lambda x: x[2])[:10]

    return best_10, worst_10

# Example usage:
best_10, worst_10 = evaluate_word_pairs(path_wordsim_similarity, vocab, cbow_model.embeddings)

print("Top 10 best performing pairs:")
for (w1, w2), human, model in best_10:
    print(f"{w1}-{w2}: human={human:.2f}, model={model:.2f}")

print("\nTop 10 worst performing pairs:")
for (w1, w2), human, model in worst_10:
    print(f"{w1}-{w2}: human={human:.2f}, model={model:.2f}")


Top 10 best performing pairs:
man-woman: human=8.30, model=0.59
coast-shore: human=9.10, model=0.58
type-kind: human=8.97, model=0.53
skin-eye: human=6.22, model=0.51
direction-combination: human=2.25, model=0.51
student-professor: human=6.81, model=0.49
situation-conclusion: human=4.81, model=0.48
development-issue: human=3.97, model=0.48
football-basketball: human=6.81, model=0.48
planet-sun: human=8.02, model=0.47

Top 10 worst performing pairs:
precedent-cognition: human=2.81, model=-0.12
volunteer-motto: human=2.56, model=-0.06
bread-butter: human=6.19, model=-0.05
precedent-group: human=1.77, model=-0.02
architecture-century: human=3.78, model=-0.02
calculation-computation: human=8.44, model=-0.02
benchmark-index: human=4.25, model=-0.01
money-cash: human=9.15, model=-0.00
money-dollar: human=8.42, model=0.00
media-radio: human=7.42, model=0.02


The best-performing pairs are usually obvious synonyms or related concepts (e.g., man-woman, coast-shore, type-kind), both model and human have high similarity score. While there also some world pairs that the model have a > 0.5 score (e.g., direction-combination) and human have very low score.

The worst-performing pairs often including rare words or named entities that don’t appear frequently in the corpus; polysemous words with multiple meanings or simmilar (e.g., money-cash vs. money-dollar); Antonyms which can be close in context but opposite in meaning.

Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

Increase training data: Use more sentences from Wikipedia or other corpora to improve vocabulary coverage and context diversity.

Increase embedding dimension: Use a larger embedding size (128 → 256 or 512) for better capacity to capture relationships.

Train longer: More epochs can improve learning, especially with larger data.

Use subword information: Like FastText, which can help with rare or unseen words.

Try Skip-gram model: Sometimes performs better than CBOW for small datasets.

Using bigger batch-size to avoid vibration in parameter update.

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hur the performance of the sentiment model.

[3 marks]

Help:

The embeddings capture semantic similarity, so sentiment words (good, great, excellent) will have similar vectors, making it easier for the sentiment model to recognize positive sentiment even with varied wording.

Contextual clues learned from training (e.g., disaster and bad appearing together) improve downstream performance.

Hurt:

If the embeddings don’t capture sentiment polarity well (e.g., good and bad are close because they appear in similar contexts), the sentiment classifier might confuse opposites.

Outdated or biased data from Wikipedia might result in embeddings that don’t reflect modern usage or subtle sentiment cues.

# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predicts the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-corpus.50000.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

To create a gold standard (what we want to predict), we need to manipulate the tensor containing the sentence. As wi want to predict the *next* word, we want the following setup (where `w_n` is the index of a word in the sentence, `x` is the input words, and `y` is the gold words):

$x = [w_0, w_1, w_2, w_3, w_4]$

$y = [w_1, w_2, w_3, w_4, w_5]$

That is, to create the gold standard we need to shift the index `n` of the input by `+1`, as this gives us the next word.


For this we'll build a new dataloader, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token. But other than that, just as before you read the dataset and output an iterator over the dataset, a vocabulary, and a mapping from words to indices. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [24]:
# you can change these numbers to suit your needs as before
lm_hyperparameters = {'epochs':10,
                      'batch_size':8,
                      'learning_rate':0.001,
                      'embedding_dim':128,
                      'output_dim':128}


In [25]:
'''
data_path = 'wiki-corpus.txt'
def get_data():
    # your code here, roughly the same as for the word2vec dataloader
'''

def get_data(data_path='wiki-corpus.txt', batch_size=16, min_freq=4):
    vocabulary = {'<pad>': 0, '<start>': 1, '<end>': 2}
    word_freq = {}

    sentences = []
    with open(data_path, 'r') as f:
        for line in f:
            tokens = line.strip().lower().split()
            for word in tokens:
                word_freq[word] = word_freq.get(word, 0) + 1

    # Assign indices to words above min_freq
    idx = 3
    for word, count in word_freq.items():
        if count >= min_freq:
            vocabulary[word] = idx
            idx += 1

    # Prepare dataset: for each sentence, create input and shifted output
    data = []
    with open(data_path, 'r') as f:
        for line in f:
            tokens = line.strip().lower().split()
            tokens = ['<start>'] + [w for w in tokens if w in vocabulary] + ['<end>']
            indices = [vocabulary[w] for w in tokens]

            # Skip too short
            if len(indices) < 2:
                continue

            data.append(indices)

    # Create batches
    batches = []
    for i in range(0, len(data), batch_size):
        batch_sentences = data[i:i + batch_size]
        max_len = max(len(s) for s in batch_sentences)

        x_batch = []
        y_batch = []

        for sentence in batch_sentences:
            x = sentence[:-1] + [vocabulary['<pad>']] * (max_len - 1 - len(sentence) + 1)
            y = sentence[1:] + [vocabulary['<pad>']] * (max_len - 1 - len(sentence) + 1)
            x_batch.append(x)
            y_batch.append(y)

        batches.append((torch.tensor(x_batch), torch.tensor(y_batch)))

    return batches, vocabulary


In [26]:
'''
class LM_withLSTM(nn.Module):
    def __init__(...):
        super(LM_withLSTM, self).__init__()
        self.embeddings = ...
        self.LSTM = nn.LSTM(self, input_size=..., hidden_size=...)
        self.predict_word = ...
    
    def forward(self, seq):
        # extract embeddings for the sentence
        embedded_seq = ...
        # compute contextual representations
        timestep_reprentation, *_ = self.LSTM(embedded_seq)
        # predict a token from the vocabulary at each timestep
        predicted_words = ...
        
        return predicted_words
 '''

class LM_withLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size):
        super(LM_withLSTM, self).__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
        self.predict_word = nn.Linear(hidden_size, vocab_size)

    def forward(self, seq):
        embedded_seq = self.embeddings(seq)  # (batch, seq_len, embedding_dim)
        timestep_representation, _ = self.LSTM(embedded_seq)  # (batch, seq_len, hidden_size)
        predicted_words = self.predict_word(timestep_representation)  # (batch, seq_len, vocab_size)
        return predicted_words


In [None]:
'''
# load data
dataset, vocab = get_data(...)

# build model and construct loss/optimizer
lm_model = LM_withLSTM(len(vocab), 
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['output_dim'])
lm_model.to(device)

loss_fn = CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=lm_hyperparameters['lr'])

# start training loop
total_loss = 0
for epoch in range(lm_hyperparameters['epochs']):
    for i, batch in enumerate(dataset):
        
        # the strucure for each BATCH is:
        # <start>, w0, ..., wn, <end>
        sentence = batch.sentence
        
        # when training the model, at each input we predict the *NEXT* token
        # consequently there is nothing to predict when we give the model 
        # <end> as input. 
        # thus, we do not want to give <end> as input to the model, select 
        # from each batch all tokens except the last. 
        # tip: use pytorch indexing/slicing (same as numpy) 
        # (https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)
        # (https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
        input_sentence = ...
        
        # send your batch of sentences to the model
        output = lm_model(input_sentence)
        
        # for each output, the model predict the NEXT token, so we have to reshape 
        # our dataset again. On timestep t, we evaluate on token t+1. That is,
        # we never predict the <start> token ;) so this time, we select all but the first 
        # token from sentences (that is, all the tokens that we predict)
        gold_data = ...
        
        # the shape of the output and sentence variable need to be changed,
        # for the loss function. Details are in the documentation.
        # You can use .view(...,...) to reshape the tensors  
        loss = loss_fn(...)
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r') 
        
        # compute gradients
        ...
        
        # update parameters
        ...
        
        # reset gradients
        ...
    print()
'''

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load data
dataset, vocab = get_data(data_path=data_path, batch_size=lm_hyperparameters['batch_size'], min_freq=4)

# Build model
lm_model = LM_withLSTM(len(vocab), lm_hyperparameters['embedding_dim'], lm_hyperparameters['output_dim'])
lm_model.to(device)

loss_fn = nn.CrossEntropyLoss(ignore_index=vocab['<pad>'])  # Don't penalize padding predictions
optimizer = optim.Adam(lm_model.parameters(), lr=lm_hyperparameters['learning_rate'])

# Training loop
for epoch in range(lm_hyperparameters['epochs']):
    total_loss = 0

    for i, (x_batch, y_batch) in enumerate(dataset):
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # Model input: all but last token
        input_sentence = x_batch

        # Gold output: all but first token
        gold_data = y_batch

        # Forward pass
        output = lm_model(input_sentence)  # (batch, seq_len, vocab_size)

        # Reshape for loss: (batch * seq_len, vocab_size) vs (batch * seq_len)
        output_flat = output.view(-1, output.size(-1))
        gold_flat = gold_data.view(-1)

        loss = loss_fn(output_flat, gold_flat)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Average Loss: {total_loss / (i+1):.4f}")


Epoch 1, Average Loss: 5.3982
Epoch 2, Average Loss: 4.8418
Epoch 3, Average Loss: 4.6003
Epoch 4, Average Loss: 4.4285
Epoch 5, Average Loss: 4.2926
Epoch 6, Average Loss: 4.1783
Epoch 7, Average Loss: 4.0788
Epoch 8, Average Loss: 3.9908
Epoch 9, Average Loss: 3.9112
Epoch 10, Average Loss: 3.8408


In [None]:
# Save the model
torch.save(lm_model.state_dict(), 'lm_lstm_model.pth')
# Save the vocabulary
with open('lm_vocab.txt', 'w') as f:
    for word, idx in vocab.items():
        f.write(f"{word}\t{idx}\n")

# Load the model
loaded_lm_model = LM_withLSTM(len(vocab), lm_hyperparameters['embedding_dim'], lm_hyperparameters['output_dim'])
# loaded_lm_model.load_state_dict(torch.load('lm_lstm_model.pth'))
lm_state_dict = torch.load('lm_lstm_model.pth', weights_only=True)  # PyTorch 2.2+
loaded_lm_model.load_state_dict(lm_state_dict)

loaded_lm_model.to(device)
loaded_lm_model.eval()  # Set the model to evaluation mode
# Load the vocabulary
loaded_lm_vocab = {}
with open('lm_vocab.txt', 'r') as f:
    for line in f:
        word, idx = line.strip().split('\t')
        loaded_lm_vocab[word] = int(idx)
# Check if the loaded model and vocabulary are the same as the original
assert lm_model.state_dict() == loaded_lm_model.state_dict(), "Model state dicts do not match!"

for key in loaded_lm_model.state_dict():
    if not torch.equal(loaded_lm_model.state_dict()[key], loaded_lm_model.state_dict()[key]):
        print(f"Mismatch in parameter: {key}")
        assert False, "Model state dicts do not match!"

assert vocab == loaded_lm_vocab, "Vocabulary does not match!"


### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [28]:
''' 
# your code goes here
import json

def evaluate_model(path, vocab, model):
    
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            bad_s = data['sentence_bad']
            
            # the data is tokenized as whitespace
            tok_good_s = ...
            tok_bad_s = ...
            
            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
            enc_good_s = torch.tensor([_ for x in tok_good_s], device=device).unsqueeze(0)
            enc_bad_s = torch.tensor([_ for x in tok_bad_s], device=device).unsqueeze(0)
            
            # pass your encoded sentences to the model and predict the next tokens
            good_s = LM_withLSTM(enc_good_s)
            bad_s = LM_withLSTM(enc_bad_s)
            
            # get probabilities with softmax
            gs_probs = F.softmax(...)
            bs_probs = F.softmax(...)
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s)
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s)
            
            accuracy.append(int(gs_sent_prob>bs_sent_prob))
            
    return accuracy
            
def find_token_probs(model_probs, encoded_sentece):
    probs = []

    # iterate over the tokens in your encoded sentence
    for token, gold_token in enumerate(encoded_sentece):
        # select the probability of the gold tokens and save
        # hint: pytorch indexing is helpful here ;)
        prob = ...
        probs.append(prob)
    sentence_prob = ...
    return sentence_prob

path = 'existential_there_quantifiers_1.jsonl'
accuracy = evaluate_model(path, ..., ...)

print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))
'''

import json
import torch.nn.functional as F
import numpy as np

def evaluate_model(path, vocab, model):
    accuracy = []
    model.eval()  # Turn off dropout etc.

    with open(path) as f:
        for line in f:
            data = json.loads(line)
            gs = data['sentence_good']
            bs = data['sentence_bad']

            # Tokenize
            tok_good_s = ['<start>'] + gs.lower().split() + ['<end>']
            tok_bad_s = ['<start>'] + bs.lower().split() + ['<end>']

            # Encode words as indices
            enc_good_s = [vocab[w] if w in vocab else vocab['<pad>'] for w in tok_good_s]
            enc_bad_s = [vocab[w] if w in vocab else vocab['<pad>'] for w in tok_bad_s]

            # Convert to tensors
            enc_good_s = torch.tensor(enc_good_s, device=device).unsqueeze(0)  # (1, S)
            enc_bad_s = torch.tensor(enc_bad_s, device=device).unsqueeze(0)    # (1, S)

            # Get predictions from the model
            good_output = model(enc_good_s)
            bad_output = model(enc_bad_s)

            # Get probabilities with softmax
            gs_probs = F.softmax(good_output, dim=-1)
            bs_probs = F.softmax(bad_output, dim=-1)

            # Compute sentence probabilities
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s)
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s)

            # If the good sentence has higher probability, count as correct
            accuracy.append(int(gs_sent_prob > bs_sent_prob))

    return accuracy

def find_token_probs(model_probs, encoded_sentence):
    probs = []
    S = encoded_sentence.shape[1]

    for t in range(S - 1):
        # At timestep t, model predicts token at t+1
        gold_token = encoded_sentence[0, t + 1]

        if gold_token.item() == 0:  # Padding token (<pad> has index 0)
            continue

        # Get the model probability assigned to the gold token
        prob = model_probs[0, t, gold_token].item()
        probs.append(prob)

    # Compute the total sentence probability as the product of token probs
    sentence_prob = np.prod(probs) if probs else 0.0
    return sentence_prob


path = './data/existential_there_quantifiers_1.jsonl'
accuracy = evaluate_model(path, vocab, lm_model)

print('Final accuracy:', np.round(np.mean(accuracy), 3))


Final accuracy: 0.703


### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

A 55% accuracy suggests the model performs slightly better than chance (50%), which is a common baseline for binary classification tasks like this (choosing between good vs. bad sentences). This indicates the model has learned some useful linguistic patterns, but performance is still modest.
Baseline to compare against:

Random guessing → 50% accuracy.

Unigram frequency model → ranks words by frequency without any syntax or context awareness.
Our model should outperform both baselines.

Suggest some improvements you could make to your language model.

[3 marks]

Increase the size of the training corpus to expose the model to more linguistic patterns and reduce overfitting.

Use a larger embedding dimension or hidden size to capture richer word and context representations.

Train for more epochs or with better hyperparameter tuning (e.g., learning rate scheduling).

Switch to bidirectional LSTM (BiLSTM) to allow the model to use both past and future context.

Incorporate pre-trained embeddings (e.g., GloVe, FastText) instead of training embeddings from scratch.

Using bigger batch-size to avoid vibration in parameter update.

Suggest some other metrics we can use to evaluate our system

[2 marks]

# Literature


Neural architectures:

[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)

[2] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

The assignment is marked on a 7-level scale where 4 is sufficient to complete the assignment; 5 is good solid work; 6 is excellent work, covers most of the assignment; and 7: creative work. 

This assignment has a total of 63 marks. These translate to grades as follows: 1 = 17% 2 = 34%, 3 = 50%, 4 = 67%, 5 = 75%, 6 = 84%, 7 = 92% where %s are interpreted as lower bounds to achieve that grade.