# Word2Vec: training from scratch, evaluation, and comparison with the pre-trained LLM from HuggingFace

**Initially, this was a home assignment for one of my NLP courses**

# 1 Word2Vec

### 1.1 Word2Vec Implementation

##### Motivation for choosing Skig-gram:

+ **Makes more sense**: in a Skip-gram model, we predict the context given the center (target word). Hence, we learn to optimize the target's word embedding to capture all the contexts it appears in. CBOW, on the other hand, optimizes the average of context words to predict the target word. Based on the common sense and the task definition (map an arbitraty word to a high-quality vector representation), Skip-gram model results in a highter quality embeddings.

+ **Simpler input format**: in a Skip-gram model, a training example is a (target, context) pair. Hence, for example, if the window size is 2, there will be 4 pairs for the single observation. In case of CBOW, one pair will look like ([context_-2, context_-1, context_1, context_2], target), after which the context word will be combined to a single vector (averaging is the simple and the standard way). Seems like the Skip-gram model provides a more simple way to do it.

Even though Skip-gram is considered to be comptutationally heavier, resulting in a slower training, I still think it is more suitable for this assignment. I do not think that our goal is to reach a perfect-quality model. Hence, we can stop the training process even before we achieve a very good quality

##### Downloading and cleaning the data

We will take one file from the website with the Wikipedia data. Firstly, we will convert it to a readable format by removing most of the unnecessary information provided in the original file

In [None]:
from gensim.corpora import WikiCorpus

def extract_text(input_xml, output_txt):
    wiki = WikiCorpus(input_xml, dictionary={})
    with open(output_txt, 'w', encoding='utf-8') as f:
        for text in wiki.get_texts():
            f.write(' '.join(text) + '\n')  # One article per line

extract_text('enwiki-some-pages-articles.xml.bz2', 'wiki_text.txt')

This results in a pre-processed collection of text (lowercasing, removed punctuation, some other standard things are already done). The next possible step is to do stemming / lemmatization. However, I suggest not doing it for the word maintainance purposes. If the vocabulary would turn out to be too big, we can return to this step

In [None]:
from collections import Counter

MIN_COUNT = 10
text = open('wiki_text.txt').read()
tokens = text.split()
word_counts = Counter(tokens)
vocab = ['<unk>'] + [word for word, count in word_counts.items() if count >= MIN_COUNT]
word_to_idx = {word:i for i, word in enumerate(vocab)}
vocab_size = len(vocab)
vocab_size

40721

**Even with min_count = 10**, the vocabulary turns out to be quite big. Hence, perhaps lemmatization makes sense

In [None]:
import spacy

In [None]:
#! python -m spacy download en_core_web_sm

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatize_line(line):
    doc = nlp(line)
    lemmatized_output = ' '.join([token.lemma_ for token in doc])
    return lemmatized_output

lemmatized_lines = []
with open('wiki_text.txt', 'r') as file:
    for line in file:
        lemmatized_line = lemmatize_line(line)
        lemmatized_lines.append(lemmatized_line)

with open('wiki_text_lemmatized.txt', 'w') as output_file:
    for line in lemmatized_lines:
        output_file.write(line + '\n')

In [None]:
from collections import Counter

MIN_COUNT = 10
text = open('wiki_text_lemmatized.txt').read()
tokens = text.split()
word_counts = Counter(tokens)
vocab = ['<unk>'] + [word for word, count in word_counts.items() if count >= MIN_COUNT]
word2idx = {word:i for i, word in enumerate(vocab)}
vocab_size = len(vocab)
vocab_size

35570

**Does not change much**. Hence, let's use the non-lemmatized version, as the sentences there are more clear. But let's make the minimal word count higher

In [None]:
from collections import Counter

MIN_COUNT = 20
text = open('wiki_text.txt').read()
tokens = text.split()
word_counts = Counter(tokens)
vocab = ['<unk>'] + [word for word, count in word_counts.items() if count >= MIN_COUNT]
word2idx = {word:i for i, word in enumerate(vocab)}
vocab_size = len(vocab)
vocab_size

24625

##### Time to build a dataset

+ tokenizer - split words by space (since we do Word2Vec, other tokenization strategies do not really make sense)

+ window_size will be equal to two (I think this is the most standard case - take 2 words before the target word, and 2 words after as the context)

+ we will create a dataset.py file, which will contain the SkipGramDataset class

+ hence, we will just import it and use it build the dataset

+ regarding negative sampling probabilities, please read comments in the code

**To make the model training process not so slow, it was decided to decrease the total number of documents to be processed (we will leave 80% of the initial documents)**

In [17]:
import random
random.seed(42)

# leave random keep_fraction of the initial documents (lines) in the file
def create_cut_file(original_file, output_file, keep_fraction=0.8):
    with open(original_file, 'r') as f:
        all_lines = [line for line in f if line.strip()]

    # Calculate number of lines to keep
    num_lines = len(all_lines)
    num_to_keep = int(num_lines * keep_fraction)

    # Randomly sample lines
    sampled_lines = random.sample(all_lines, num_to_keep)
    sampled_lines.sort()
    with open(output_file, 'w') as f:
        f.writelines(sampled_lines)

original_file = 'wiki_text.txt'
output_file = 'cut_text.txt'
create_cut_file(original_file, output_file, keep_fraction=0.8)

In [18]:
from dataset import SkipGramDataset
file_path = 'cut_text.txt'  # Decided to use the non-lemmatized text
min_count = 20
window_size = 2

dataset = SkipGramDataset(file_path, window_size=window_size, min_count=min_count)
dataset.vocab_size

21413

In [19]:
# I was experimenting with different batch sizes
from torch.utils.data import DataLoader
# dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
# dataloader = DataLoader(dataset, batch_size=256, shuffle=True)
# dataloader = DataLoader(dataset, batch_size=1024, shuffle=True)
dataloader = DataLoader(dataset, batch_size=2048, shuffle=True)

##### Train a Skip-gram model

In [20]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from skipgram import SkipGramModel

In [21]:
# Either mps or cuda, depends on which device I'm currenlty using
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Device:', device)

# Initialize all three models
model_100 = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=100).to(device)
model_300 = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=300).to(device)
model_500 = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=500).to(device)

Device: mps


In [22]:
from tqdm import tqdm
import os

num_negative = 5
learning_rate = 0.005

# Training loop
def train_model(model, model_name, num_epochs=10, save_dir='model_checkpoints'):
  print('Model name:', model)
  optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
  for epoch in range(num_epochs):
      total_loss = 0
      # Wrap the dataloader with tqdm for a progress bar
      with tqdm(dataloader, desc=f'Epoch {epoch+1}/{num_epochs}', unit='batch') as pbar:
          for targets, contexts in pbar:
              targets = targets.to(device)
              contexts = contexts.to(device)
              optimizer.zero_grad()

              # Get embeddings
              target_emb, context_emb = model(targets, contexts)
              pos_score = (target_emb * context_emb).sum(dim=1)  # Dot product for positive pairs

              # Sample negative words
              negative_samples = torch.multinomial(dataset.negative_sampling_probs,
                                                targets.size(0) * num_negative,
                                                replacement=True).view(targets.size(0), num_negative).to(device)
              negative_emb = model.context_embeddings(negative_samples)
              neg_scores = (target_emb.unsqueeze(1) * negative_emb).sum(dim=2)  # Dot products for negative pairs

              # Compute loss
              pos_loss = F.binary_cross_entropy_with_logits(pos_score, torch.ones_like(pos_score), reduction='sum')
              neg_loss = F.binary_cross_entropy_with_logits(neg_scores, torch.zeros_like(neg_scores), reduction='sum')
              loss = (pos_loss + neg_loss) / targets.size(0)

              # Backpropagation
              loss.backward()
              optimizer.step()

              total_loss += loss.item()

              # Update the progress bar with the current loss
              pbar.set_postfix({'loss': loss.item()})

      print(f'Epoch {epoch+1}, Average Loss: {total_loss / len(dataloader)}')

  save_path = os.path.join(save_dir, model_name)
  torch.save(model.state_dict(), save_path)
  print(f'Model weights saved to {save_path}')

**To make the experiment fair, we will train each model for 10 epochs**

In [23]:
from collections import Counter
min_count = 20
with open(file_path, 'r') as f:
      documents = [line.strip().split() for line in f]
      word_counts = Counter(word for doc in documents for word in doc)
      vocab = [word for word, count in word_counts.items() if count >= min_count]

In [24]:
len(vocab)

21413

In [25]:
import pandas as pd
import numpy as np

In [26]:
# https://www.kaggle.com/datasets/julianschelb/wordsim353-crowd
wordsim = pd.read_csv('wordsim353crowd.csv')
wordsim

Unnamed: 0,Word 1,Word 2,Human (Mean)
0,admission,ticket,5.5360
1,alcohol,chemistry,4.1250
2,aluminum,metal,6.6250
3,announcement,effort,2.0625
4,announcement,news,7.1875
...,...,...,...
348,weapon,secret,2.5000
349,weather,forecast,5.4375
350,Wednesday,news,1.1250
351,wood,forest,7.9375


In [27]:
wordsim['Word 1'] = wordsim['Word 1'].str.lower()
wordsim['Word 2'] = wordsim['Word 2'].str.lower()

# Step 2: Filter rows where both words are in vocab
wordsim = wordsim[wordsim['Word 1'].isin(vocab) & wordsim['Word 2'].isin(vocab)]
wordsim.index = np.arange(0, len(wordsim))
wordsim

Unnamed: 0,Word 1,Word 2,Human (Mean)
0,admission,ticket,5.5360
1,alcohol,chemistry,4.1250
2,aluminum,metal,6.6250
3,announcement,effort,2.0625
4,announcement,news,7.1875
...,...,...,...
299,weapon,secret,2.5000
300,weather,forecast,5.4375
301,wednesday,news,1.1250
302,wood,forest,7.9375


In [28]:
def cosine_similarity(vec1, vec2):
    return torch.dot(vec1, vec2) / (torch.norm(vec1) * torch.norm(vec2) + 1e-8) # 1e-8 to avoid division by zero

In [29]:
from scipy.stats import spearmanr

def compute_correlation(model, df):
  cosine_similarities = []
  human_scores = []
  for index, row in df.iterrows():
      word1 = row['Word 1']
      word2 = row['Word 2']
      human_score = row['Human (Mean)']

      # Get embeddings for Word 1 and Word 2
      word_1_index = dataset.convert_word_to_idx(word1)
      word_1_tensor = torch.tensor([word_1_index], device=device)  # Create tensor directly on the device
      embedding1 = model.target_embeddings(word_1_tensor).squeeze(0)

      word_2_index = dataset.convert_word_to_idx(word2)
      word_2_tensor = torch.tensor([word_2_index], device=device)  # Create tensor directly on the device
      embedding2 = model.target_embeddings(word_2_tensor).squeeze(0)


      # Compute cosine similarity
      similarity = cosine_similarity(embedding1, embedding2).item()  # Convert to Python float
      cosine_similarities.append(similarity)
      human_scores.append(human_score)

  # Compute Spearman’s correlation coefficient
  correlation, _ = spearmanr(cosine_similarities, human_scores)
  print(f"Spearman’s correlation coefficient: {correlation:.4f}")
  return correlation

In [16]:
train_model(model_100, 'emb_dim_100.pt')

Model name: SkipGramModel(
  (target_embeddings): Embedding(21413, 100)
  (context_embeddings): Embedding(21413, 100)
)


Epoch 1/10: 100%|██████████| 11180/11180 [02:12<00:00, 84.54batch/s, loss=2.31] 


Epoch 1, Average Loss: 3.757320485110786


Epoch 2/10: 100%|██████████| 11180/11180 [02:09<00:00, 86.02batch/s, loss=2.18]


Epoch 2, Average Loss: 2.1817281813229132


Epoch 3/10: 100%|██████████| 11180/11180 [02:03<00:00, 90.37batch/s, loss=2.13] 


Epoch 3, Average Loss: 2.1177927609206524


Epoch 4/10: 100%|██████████| 11180/11180 [02:06<00:00, 88.73batch/s, loss=2.09] 


Epoch 4, Average Loss: 2.0979157258444907


Epoch 5/10: 100%|██████████| 11180/11180 [02:02<00:00, 91.54batch/s, loss=2.16] 


Epoch 5, Average Loss: 2.089140125741259


Epoch 6/10: 100%|██████████| 11180/11180 [02:06<00:00, 88.21batch/s, loss=2.06]


Epoch 6, Average Loss: 2.0839818964170855


Epoch 7/10: 100%|██████████| 11180/11180 [02:07<00:00, 87.88batch/s, loss=2.13] 


Epoch 7, Average Loss: 2.080983997105699


Epoch 8/10: 100%|██████████| 11180/11180 [02:05<00:00, 89.29batch/s, loss=2.11] 


Epoch 8, Average Loss: 2.078605863659881


Epoch 9/10: 100%|██████████| 11180/11180 [02:06<00:00, 88.68batch/s, loss=2.11] 


Epoch 9, Average Loss: 2.077235227887234


Epoch 10/10: 100%|██████████| 11180/11180 [02:05<00:00, 89.25batch/s, loss=2.18]

Epoch 10, Average Loss: 2.076324238184313
Model weights saved to model_checkpoints/emb_dim_100.pt





In [15]:
checkpoint = torch.load('model_checkpoints/emb_dim_100.pt', map_location=device)
model_100 = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=100).to(device)
model_100.load_state_dict(checkpoint)
compute_correlation(model_100, wordsim)

Spearman’s correlation coefficient: 0.4319


0.4318654646450173

In [18]:
train_model(model_300, 'emb_dim_300.pt')

Model name: SkipGramModel(
  (target_embeddings): Embedding(21413, 300)
  (context_embeddings): Embedding(21413, 300)
)


Epoch 1/10: 100%|██████████| 11180/11180 [02:43<00:00, 68.51batch/s, loss=2.85]


Epoch 1, Average Loss: 6.145718262468553


Epoch 2/10: 100%|██████████| 11180/11180 [02:40<00:00, 69.57batch/s, loss=2.39]


Epoch 2, Average Loss: 2.4446142360434764


Epoch 3/10: 100%|██████████| 11180/11180 [02:39<00:00, 70.27batch/s, loss=2.18]


Epoch 3, Average Loss: 2.249341986153761


Epoch 4/10: 100%|██████████| 11180/11180 [02:41<00:00, 69.07batch/s, loss=2.16]


Epoch 4, Average Loss: 2.195541905717048


Epoch 5/10: 100%|██████████| 11180/11180 [02:37<00:00, 71.14batch/s, loss=2.16]


Epoch 5, Average Loss: 2.1691580428420325


Epoch 6/10: 100%|██████████| 11180/11180 [02:40<00:00, 69.79batch/s, loss=2.22]


Epoch 6, Average Loss: 2.1536129850181145


Epoch 7/10: 100%|██████████| 11180/11180 [02:39<00:00, 70.30batch/s, loss=2.19]


Epoch 7, Average Loss: 2.1434251589297397


Epoch 8/10: 100%|██████████| 11180/11180 [02:40<00:00, 69.48batch/s, loss=2.22]


Epoch 8, Average Loss: 2.1363779625653794


Epoch 9/10: 100%|██████████| 11180/11180 [02:39<00:00, 70.22batch/s, loss=2.21]


Epoch 9, Average Loss: 2.131725310330314


Epoch 10/10: 100%|██████████| 11180/11180 [02:38<00:00, 70.71batch/s, loss=2.14]

Epoch 10, Average Loss: 2.1277746795648325
Model weights saved to model_checkpoints/emb_dim_300.pt





In [16]:
checkpoint = torch.load('model_checkpoints/emb_dim_300.pt', map_location=device)
model_300 = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=300).to(device)
model_300.load_state_dict(checkpoint)
compute_correlation(model_300, wordsim)

Spearman’s correlation coefficient: 0.4627


0.4626894580061271

In [20]:
train_model(model_500, 'emb_dim_500.pt')

Model name: SkipGramModel(
  (target_embeddings): Embedding(21413, 500)
  (context_embeddings): Embedding(21413, 500)
)


Epoch 1/10: 100%|██████████| 11180/11180 [03:16<00:00, 56.94batch/s, loss=3.35]


Epoch 1, Average Loss: 8.265202607025188


Epoch 2/10: 100%|██████████| 11180/11180 [03:16<00:00, 56.98batch/s, loss=2.82]


Epoch 2, Average Loss: 2.8015666036784967


Epoch 3/10: 100%|██████████| 11180/11180 [03:16<00:00, 56.89batch/s, loss=2.54]


Epoch 3, Average Loss: 2.454843101646478


Epoch 4/10: 100%|██████████| 11180/11180 [03:15<00:00, 57.30batch/s, loss=2.54]


Epoch 4, Average Loss: 2.370408450640165


Epoch 5/10: 100%|██████████| 11180/11180 [03:14<00:00, 57.35batch/s, loss=2.31]


Epoch 5, Average Loss: 2.3267650905576716


Epoch 6/10: 100%|██████████| 11180/11180 [03:18<00:00, 56.44batch/s, loss=2.38]


Epoch 6, Average Loss: 2.301671056350783


Epoch 7/10: 100%|██████████| 11180/11180 [03:17<00:00, 56.66batch/s, loss=2.44]


Epoch 7, Average Loss: 2.286181752626286


Epoch 8/10: 100%|██████████| 11180/11180 [03:14<00:00, 57.34batch/s, loss=2.26]


Epoch 8, Average Loss: 2.27491041338721


Epoch 9/10: 100%|██████████| 11180/11180 [03:17<00:00, 56.49batch/s, loss=2.21]


Epoch 9, Average Loss: 2.2668731922753595


Epoch 10/10: 100%|██████████| 11180/11180 [03:15<00:00, 57.21batch/s, loss=2.39]

Epoch 10, Average Loss: 2.261783000203066
Model weights saved to model_checkpoints/emb_dim_500.pt





In [18]:
checkpoint = torch.load('model_checkpoints/emb_dim_500.pt', map_location=device)
model_500 = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=500).to(device)
model_500.load_state_dict(checkpoint)
compute_correlation(model_500, wordsim)

Spearman’s correlation coefficient: 0.4144


0.41442795960423207

##### Comparison

As wee see, the model with embedding_dim=300 is the best in terms of quality (correlation with the human-annotated dataset). And the model with embedding_dim=500 is the worst. This result makes sense, and cab explained using the following logic:

+ Not enough data for bigger models (remember the curse of dimensionality). That's why embedding_dim=500 is the worst performance We cannot handle it due to time and resource constaints

+ Not enough training. Same issue as within the previous case

+ Architectures with bigger embedding sizes need to be more complex (for example, several layes, activation functions, some normalizations, etc.). However, we do not want to compare different models with each other. We want to compare the same model with different amount of parameters

+ Meanwhile, embedding_dim=100 might be not enough to capture the word mearnings. Hence, embedding_dim=300 might be better. For now the difference in quality is not that big (however, this is only one metric, and probably not that comprehensive and straightforward). Maybe with the increase in the number of training epochs, the difference in quality will be more significant. However, in next sections we will have to train more models, and I suggest choosing the embedding_dim=100 as the baseline (not that worse compared to the embedding_dim=300, but faster to train)

Hence, for our next experiments (for example, for improvements), we will use the model with embedding_dim=100

### 1.2 Word2Vec improvement

Out of all the proposed methods, we will **focus on the leveraging external word knowledge sources. Motivation:**

+ **Word sense disambiguation** (check https://www.geeksforgeeks.org/word-sense-disambiguation-in-natural-language-processing/ for reference), as far as I understood, is about handling words with multiple meanings by assigning different embeddings to each sense. However, this will require specific high-quality dataset (not random Wikipedia articles). Moreover, this method is about handling specific cases, while we want to improve the general quality of embeddings (given time and resource constraints)

+ **Evaluating character-level embeddings** is about different tokenization (splitting text not by a space, but by a character). I do not think this makes a lot of sense in out task, since Word2Vec is Word2Vec, and it aims to convert a word to an embedding. In other tasks (like text generation) different tokenizations (like BPE) make more sense, but they also involve different neural network architectures.

+ **Hence, I think that Leveraging external word knowledge sources makes the most sense in our task**. We will use WordNet to extract synonym relationships and adjust embeddings through retrofitting. It does not require retraining the model, which is good




In [30]:
import nltk

In [22]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [15]:
list(dataset.word_to_index.keys())[:5]

['aarhus', 'was', 'handball', 'club', 'from']

In [16]:
list(dataset.word_to_index.values())[:5]

[0, 1, 2, 3, 4]

In [17]:
from nltk.corpus import wordnet as wn
# Extract synonym pairs from WordNet
def get_synonym_pairs(word_to_index):
    synonym_pairs = []
    for word in vocab:
        synsets = wn.synsets(word)
        for synset in synsets:
            for lemma in synset.lemmas():
                syn = lemma.name()
                if syn in vocab and syn != word:
                    # Add both directions since synonymy is symmetric
                    synonym_pairs.append((word_to_index[word], word_to_index[syn]))
                    synonym_pairs.append((word_to_index[syn], word_to_index[word]))
    # Remove duplicates while preserving order
    synonym_pairs = list(dict.fromkeys(synonym_pairs))
    return synonym_pairs

In [18]:
synonym_pairs = get_synonym_pairs(dataset.word_to_index)
synonym_pairs[:5]

[(1, 232), (232, 1), (1, 12559), (12559, 1), (1, 2349)]

In [34]:
lambdda = 0.1  # weight for synonym loss (let it be 0.1 so that it's not too big, as I am not very sure if the method will work good)
m = 128 # number of synonym pairs to sample per batch (I think this number is neither too big nor too small, hence, should be okay)

def train_model_knowledge(model, model_name, num_epochs=10, save_dir='model_checkpoints', lambdda=lambdda, m=m):
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    print(f"Found {len(synonym_pairs)} synonym pairs from WordNet.")

    for epoch in range(num_epochs):
        total_loss = 0
        with tqdm(dataloader, desc=f'Epoch {epoch+1}/{num_epochs}', unit='batch') as pbar:
            for targets, contexts in pbar:
                targets = targets.to(device)
                contexts = contexts.to(device)
                optimizer.zero_grad()

                target_emb, context_emb = model(targets, contexts)
                pos_score = (target_emb * context_emb).sum(dim=1)  # Dot product for positive pairs

                negative_samples = torch.multinomial(dataset.negative_sampling_probs,
                                                    targets.size(0) * num_negative,
                                                    replacement=True).view(targets.size(0), num_negative).to(device)
                negative_emb = model.context_embeddings(negative_samples)
                neg_scores = (target_emb.unsqueeze(1) * negative_emb).sum(dim=2)

                # Original loss
                pos_loss = F.binary_cross_entropy_with_logits(pos_score, torch.ones_like(pos_score), reduction='sum')
                neg_loss = F.binary_cross_entropy_with_logits(neg_scores, torch.zeros_like(neg_scores), reduction='sum')

                # Step 2: Sample synonym pairs and compute synonym loss
                # Sample m synonym pairs with replacement
                sampled_pairs = random.choices(synonym_pairs, k=m)
                w1_indices = torch.tensor([pair[0] for pair in sampled_pairs], device=device)
                w2_indices = torch.tensor([pair[1] for pair in sampled_pairs], device=device)

                # Get embeddings for synonym pairs
                target_emb_w1 = model.target_embeddings(w1_indices)
                context_emb_w2 = model.context_embeddings(w2_indices)
                syn_score = (target_emb_w1 * context_emb_w2).sum(dim=1)

                # Synonym loss to encourage similarity
                syn_loss = F.binary_cross_entropy_with_logits(syn_score, torch.ones_like(syn_score), reduction='sum')

                # Step 3: Combine losses
                loss = (pos_loss + neg_loss + lambdda * syn_loss) / targets.size(0)

                # Backpropagation
                loss.backward()
                optimizer.step()

                total_loss += loss.item()
                pbar.set_postfix({'loss': loss.item()})

        print(f'Epoch {epoch+1}, Average Loss: {total_loss / len(dataloader)}')

    save_path = os.path.join(save_dir, model_name)
    torch.save(model.state_dict(), save_path)
    print(f'Model weights saved to {save_path}')

##### Let's check if it works

In [28]:
model_100_knowledge = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=100).to(device)

In [29]:
train_model_knowledge(model_100_knowledge, '100_knowledge.pt')

Found 65240 synonym pairs from WordNet.


Epoch 1/10: 100%|██████████| 11180/11180 [02:07<00:00, 87.83batch/s, loss=2.31] 


Epoch 1, Average Loss: 3.7862475979946595


Epoch 2/10: 100%|██████████| 11180/11180 [02:04<00:00, 89.52batch/s, loss=2.25] 


Epoch 2, Average Loss: 2.188947153709869


Epoch 3/10: 100%|██████████| 11180/11180 [02:07<00:00, 87.80batch/s, loss=2.13] 


Epoch 3, Average Loss: 2.125254548918156


Epoch 4/10: 100%|██████████| 11180/11180 [02:04<00:00, 89.89batch/s, loss=2.17]


Epoch 4, Average Loss: 2.10561693280882


Epoch 5/10: 100%|██████████| 11180/11180 [02:06<00:00, 88.59batch/s, loss=2.07]


Epoch 5, Average Loss: 2.097024903378461


Epoch 6/10: 100%|██████████| 11180/11180 [02:04<00:00, 89.57batch/s, loss=2.09]


Epoch 6, Average Loss: 2.0919627403317285


Epoch 7/10: 100%|██████████| 11180/11180 [02:06<00:00, 88.71batch/s, loss=2.15]


Epoch 7, Average Loss: 2.0891373323626508


Epoch 8/10: 100%|██████████| 11180/11180 [02:05<00:00, 89.06batch/s, loss=2.1] 


Epoch 8, Average Loss: 2.087162859441982


Epoch 9/10: 100%|██████████| 11180/11180 [02:06<00:00, 88.52batch/s, loss=2.24] 


Epoch 9, Average Loss: 2.08559569080955


Epoch 10/10: 100%|██████████| 11180/11180 [02:03<00:00, 90.52batch/s, loss=2.16]

Epoch 10, Average Loss: 2.0844823240487433
Model weights saved to model_checkpoints/100_knowledge.pt





**Let's try to make the new loss way more important and see what it will result at**

In [60]:
model_100_knowledge_importance = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=100).to(device)

In [61]:
train_model_knowledge(model_100_knowledge_importance, '100_knowledge_importance.pt', lambdda=2, m=512)

Found 65240 synonym pairs from WordNet.


Epoch 1/10: 100%|██████████| 11180/11180 [02:09<00:00, 86.53batch/s, loss=2.37]


Epoch 1, Average Loss: 3.950470957026712


Epoch 2/10: 100%|██████████| 11180/11180 [02:09<00:00, 86.37batch/s, loss=2.15] 


Epoch 2, Average Loss: 2.235824123456781


Epoch 3/10: 100%|██████████| 11180/11180 [02:08<00:00, 86.96batch/s, loss=2.23]


Epoch 3, Average Loss: 2.1639981471788476


Epoch 4/10: 100%|██████████| 11180/11180 [02:09<00:00, 86.45batch/s, loss=2.17]


Epoch 4, Average Loss: 2.1425531519333663


Epoch 5/10: 100%|██████████| 11180/11180 [02:10<00:00, 85.69batch/s, loss=2.13]


Epoch 5, Average Loss: 2.1331264216272903


Epoch 6/10: 100%|██████████| 11180/11180 [02:07<00:00, 87.79batch/s, loss=2.12]


Epoch 6, Average Loss: 2.127753461814737


Epoch 7/10: 100%|██████████| 11180/11180 [02:11<00:00, 85.08batch/s, loss=2.17] 


Epoch 7, Average Loss: 2.124710267281063


Epoch 8/10: 100%|██████████| 11180/11180 [02:07<00:00, 87.35batch/s, loss=2.2]


Epoch 8, Average Loss: 2.1228138141320727


Epoch 9/10: 100%|██████████| 11180/11180 [02:11<00:00, 85.27batch/s, loss=2.18]


Epoch 9, Average Loss: 2.121129121059595


Epoch 10/10: 100%|██████████| 11180/11180 [02:08<00:00, 86.93batch/s, loss=2.17]

Epoch 10, Average Loss: 2.1202801537321805
Model weights saved to model_checkpoints/100_knowledge_importance.pt





In [None]:
# metric we need to beat: 0.4319
checkpoint = torch.load('model_checkpoints/100_knowledge.pt', map_location=device)
model_100_knowledge = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=100).to(device)
model_100_knowledge.load_state_dict(checkpoint)
compute_correlation(model_100_knowledge, wordsim)

Spearman’s correlation coefficient: 0.4615


0.4615047166264469

In [20]:
# metric we need to beat: 0.4319
checkpoint = torch.load('model_checkpoints/100_knowledge_importance.pt', map_location=device)
model_100_knowledge_importance = SkipGramModel(vocab_size=len(dataset.word_to_index), embedding_dim=100).to(device)
model_100_knowledge_importance.load_state_dict(checkpoint)
compute_correlation(model_100_knowledge_importance, wordsim)

Spearman’s correlation coefficient: 0.4935


0.49346005577594065

**Okay, in this case the metric improved. Hence, probably the proposed improvements make sense. The model with more influence of the external knowledge performs better than the one having less influence of the external knowledge. However, this single metric is not that straightforward, and I do not think we can compare the model performances based on it only. Fortunately, we have other metrics. For example, it is worth trying to visualize the embeddings (not all, a small part since there are too many words in the vocabulary), and to see the most similar words to a particular word. Let's do it**

In [19]:
from quality_assessment import most_similar

##### Check the results with most similar words

**We will compare three models:**

+ emb_dim_100 - the first Word2Vec we trained (no knowledge, embedding size=100)

+ 100_knowledge - we use knowledge (WordNet dataset), but the weight of it is very low

+ 100_knowledge_importance - loss defined by using WordNet dataset is used and it is very important

**The goal is to figute out, whether the modification we introduced in this section makes sense.** Keep in mind that I will use subjective rules to identify it (check the most similar words, visualize the embeddings). However, I will try to truly estimate, whether the experience made sense or no

Apple

In [20]:
# makes sense; maybe not the type of apple that we want, but keep in mind that we are using random wikipeida articles for training
# the best out of three
most_similar('emb_dim_100.pt', 'apple')

['ios', 'android', 'microsoft', 'amazon', 'lemon']

In [None]:
# worse
most_similar('100_knowledge.pt', 'apple')

  checkpoint = torch.load(save_path, map_location=device)


['inc', 'hardware', 'coffee', 'keynote', 'electronics']

In [None]:
# worse
most_similar('100_knowledge_importance.pt', 'apple')

  checkpoint = torch.load(save_path, map_location=device)


['coconut', 'sugar', 'manufacturing', 'egg', 'nut']

Lemon

In [None]:
# not bad
most_similar('emb_dim_100.pt', 'lemon')

  checkpoint = torch.load(save_path, map_location=device)


['lime', 'mango', 'sticks', 'citrus', 'mouse']

In [None]:
# better, this and the next one are the best
most_similar('100_knowledge.pt', 'lemon')

  checkpoint = torch.load(save_path, map_location=device)


['leaf', 'citron', 'lime', 'mandarins', 'grapefruit']

In [None]:
# better, this and the previous one are the best
most_similar('100_knowledge_importance.pt', 'lemon')

  checkpoint = torch.load(save_path, map_location=device)


['papilio', 'grapefruit', 'orange', 'oranges', 'citron']

University

In [None]:
# good, all three are good
most_similar('emb_dim_100.pt', 'university')

  checkpoint = torch.load(save_path, map_location=device)


['faculty', 'college', 'institute', 'polytechnic', 'stanford']

In [None]:
# good, all three are good
most_similar('100_knowledge.pt', 'university')

  checkpoint = torch.load(save_path, map_location=device)


['college', 'institute', 'harvard', 'museum', 'faculty']

In [None]:
# good, all three are good
most_similar('100_knowledge_importance.pt', 'university')

  checkpoint = torch.load(save_path, map_location=device)


['college', 'stanford', 'polytechnic', 'faculty', 'campus']

Human

In [69]:
most_similar('emb_dim_100.pt', 'human')

  checkpoint = torch.load(save_path, map_location=device)


['rights', 'social', 'cells', 'animal', 'violence']

In [70]:
most_similar('100_knowledge.pt', 'human')

  checkpoint = torch.load(save_path, map_location=device)


['prevention', 'rights', 'genetic', 'animal', 'related']

In [None]:
# I think this makes the most sense, but others are good as well
most_similar('100_knowledge_importance.pt', 'human')

  checkpoint = torch.load(save_path, map_location=device)


['humans', 'environmental', 'humanity', 'economic', 'behavioral']

Bottle

In [None]:
# makes the least sense
most_similar('emb_dim_100.pt', 'bottle')

  checkpoint = torch.load(save_path, map_location=device)


['wine', 'thread', 'slice', 'set', 'cloth']

In [73]:
most_similar('100_knowledge.pt', 'bottle')

  checkpoint = torch.load(save_path, map_location=device)


['sheet', 'tree', 'plastic', 'sheets', 'bones']

In [None]:
# makes the most sense
most_similar('100_knowledge_importance.pt', 'bottle')

  checkpoint = torch.load(save_path, map_location=device)


['wine', 'beats', 'bath', 'bottles', 'glass']

Check some synonym pairs

In [44]:
for pair in synonym_pairs[140:155]:
    print(dataset.convert_idx_to_word(pair[0]), dataset.convert_idx_to_word(pair[1]))

later after
after later
withdraw retreat
retreat withdraw
withdraw retire
retire withdraw
withdraw recall
recall withdraw
withdraw draw
draw withdraw
withdraw remove
remove withdraw
withdraw take
take withdraw
home place


In [47]:
for pair in synonym_pairs[180:190]:
    print(dataset.convert_idx_to_word(pair[0]), dataset.convert_idx_to_word(pair[1]))

arena area
area arena
arena orbit
orbit arena
arena field
field arena
arena stadium
stadium arena
arena bowl
bowl arena


Based on this, let's check the following words: withdraw, arena

Based on these examples, we are expected to see the following words: 

withdraw -> remove, take, retreat, retire, recall

arena -> orbit, field, bowl, stadium

In [None]:
# 0 intersections with out expectations
most_similar('emb_dim_100.pt', 'withdraw')

['immediately', 'proceeded', 'illegally', 'resign', 'arrest']

In [None]:
# 3 intesections with our expectations
most_similar('100_knowledge.pt', 'withdraw')

['remove', 'retire', 'call', 'convert', 'retreat']

In [None]:
# 4 intersections with our expectations !!!
most_similar('100_knowledge_importance.pt', 'withdraw')

['remove', 'withdrew', 'retreat', 'pull', 'retire']

In [None]:
# 0 intersections with our expectations
most_similar('emb_dim_100.pt', 'arena')

['coliseum', 'center', 'steelhawks', 'saturday', 'edt']

In [None]:
# 1 intersection with our expectations
most_similar('100_knowledge.pt', 'arena')

['coliseum', 'indoor', 'stadium', 'raiders', 'steelhawks']

In [None]:
# 2 intersections with our expectations
most_similar('100_knowledge_importance.pt', 'arena')

['stadium', 'arenas', 'field', 'lancers', 'coliseum']

##### Draw conclusions based on the most similar words analysis:

+ The new approach and the new loss function work! It is clearly seen that new Word2Vec model adjusts its embeddings to match the expectations of the WordNet dataset (see the last experiment to make sure)

+ In terms of arbitrary words (not the ones we were specifically looking at using the WordNet dataset), I think the new Word2Vec model (with external knowledge), on average, makes a bit more sense, but just a little bit. Meanwhile, I have no questions regarding if the approach works 

+ Instead, the quality of the WordNet dataset is questionable. I do not think it is very useful in our task. Hence, we do not need to rely on it a lot, and limit the influence of this dataset (this can be easily done with the hyperparameters we introduced, m and lambda)

+ Much more important thing is having more data and more time to train (at least much more Wiki articles, more training epochs, maybe some experiments with the architecture)

**Let's visualize the embeddings**

Inspired by: https://github.com/ashaba1in/hse-nlp/blob/main/2023/seminars/week1_word_embeddings.ipynb (NLP course from my undergrad)

In [21]:
import bokeh.models as bm
import bokeh.plotting as pl
from bokeh.io import output_notebook
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from quality_assessment import visualize_embeddings_interactive

# Ensure Bokeh output is set to notebook if using Jupyter
output_notebook()

In [None]:
# model without the external knowledge
# both this model and the next one are good and make sense. I will summarize the results of the next model in the cell below
visualize_embeddings_interactive(model_name='emb_dim_100.pt', num_words=1000, seed=42, device=device,
                                    radius=10, alpha=0.5, color='blue',
                                    width=1000, height=800, show=True)

In [None]:
# model with the external knowledge
visualize_embeddings_interactive(model_name='100_knowledge_importance.pt', num_words=1000, seed=42, device=device,
                                    radius=10, alpha=0.5, color='blue',
                                    width=1000, height=800, show=True)

##### Analysis of the visual results

**I will summarize what I saw at the knowledge model graph. However, most of these results are also applicable to a model without the external knowledge (but probably the model with the knowledge makes slightly more sense)**

+ I see that the locations (like the US states), and the professions are close to each other. Also some unusual words (perhaps from the same languages which are not English) are close to each other

+ Words in the past tense are close to each other (like brought, agreed, etc.)

+ Some specific terminology (like bridge, resorvoir, mine, gate, which is probably related to some factory or constuction) are close to each other

+ Countries are close to each other (Cuba, Iceland, Mongolia, Arabia, etc.) are close to each other but far from the states which I mentioned before

+ Overall, it really makes sense at least in many cases, and obviously I did not summarize everything, these are just some examples. I think the model is good, especially given the data quality and quantity, and the resource contraints.

# 2 Pretrained Model Embedding Generation

In [71]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-1B-sft-bf16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM-1B-sft-bf16", trust_remote_code=True)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


**We will extract embeddings from the input embedding matrix**

In [72]:
all_embeddings = model.get_input_embeddings()
words = ["apple", "appple", "chair", "boy", "peach"]
embedding_reflection = {}
for word in words:
    tokens = tokenizer.encode(word, add_special_tokens=False)
    word_embeddings = []
    for token in tokens:
        word_embedding = all_embeddings(torch.tensor(token))
        word_embeddings.append(word_embedding)
    word_embeddings = torch.stack(word_embeddings)
    print(word_embeddings.shape)
    embedding_reflection[word] = word_embeddings

torch.Size([1, 1536])
torch.Size([2, 1536])
torch.Size([1, 1536])
torch.Size([1, 1536])
torch.Size([2, 1536])


In [73]:
for word1, embedding1 in embedding_reflection.items():
    for word2, embedding2 in embedding_reflection.items():

        # here we simply use the average embeddings if one word are encoded into many tokens
        avg_embedding1 = torch.mean(embedding1, dim=0, keepdim=True)
        avg_embedding2 = torch.mean(embedding2, dim=0, keepdim=True)

        cosine_sim = F.cosine_similarity(avg_embedding1, avg_embedding2)
        print(f"{word1} {word2} = {cosine_sim.item()}")

apple apple = 1.000000238418579
apple appple = 0.6285947561264038
apple chair = 0.6110920906066895
apple boy = 0.6109570860862732
apple peach = 0.6368356347084045
appple apple = 0.6285947561264038
appple appple = 1.0000003576278687
appple chair = 0.5873134136199951
appple boy = 0.5639569163322449
appple peach = 0.6722849011421204
chair apple = 0.6110920906066895
chair appple = 0.5873134136199951
chair chair = 1.0000001192092896
chair boy = 0.5883687734603882
chair peach = 0.5883920788764954
boy apple = 0.6109570860862732
boy appple = 0.5639569163322449
boy chair = 0.5883687734603882
boy boy = 1.000000238418579
boy peach = 0.6021623611450195
peach apple = 0.6368356347084045
peach appple = 0.6722849011421204
peach chair = 0.5883920788764954
peach boy = 0.6021623611450195
peach peach = 1.000000238418579


**Seems like this is all I needed to do for this part of the assignment**

# 3 Evaluation

### 3.1 Word Similarity Task:

Use the WordSim353 dataset, which contains 353 word pairs with
human-annotated similarity scores. Compute the Spearman’s rank cor-
relation coefficient for the Word2Vec and Small Language Model embeddings

In our case, wordsim was filtered (since words must occur in out vocabulary), and it has 304 word pairs in total. We calculated the Spearman’s rank correlation coefficient for each of the Word2Vec model just after we trained them (please refer to section 1.1 to see the details). In summary:

Spearman’s correlation coefficient for the best model (embedding_dim=100) without the external knowledge: 0.4319

Spearman’s correlation coefficient for the best model (embedding_dim=100) with the external knowledge: 0.3912

Let's see how the model from huggingface will perform

In [27]:
wordsim

Unnamed: 0,Word 1,Word 2,Human (Mean)
0,admission,ticket,5.5360
1,alcohol,chemistry,4.1250
2,aluminum,metal,6.6250
3,announcement,effort,2.0625
4,announcement,news,7.1875
...,...,...,...
299,weapon,secret,2.5000
300,weather,forecast,5.4375
301,wednesday,news,1.1250
302,wood,forest,7.9375


In [33]:
def get_embedding(word, tokenizer=tokenizer):
    tokens = tokenizer.encode(word, add_special_tokens=False)
    word_embeddings = []
    for token in tokens:
        word_embedding = all_embeddings(torch.tensor(token))
        word_embeddings.append(word_embedding)
    word_embeddings = torch.stack(word_embeddings)
    word_embedding = torch.mean(word_embeddings, dim=0, keepdim=True)
    return word_embedding

In [56]:
apple_embedding = get_embedding('apple')
apple_embedding.shape

torch.Size([1, 1536])

In [34]:
def compute_correlation(df):
  cosine_similarities = []
  human_scores = []
  for index, row in df.iterrows():
      word1 = row['Word 1']
      word2 = row['Word 2']
      human_score = row['Human (Mean)']

      # Get embeddings for Word 1 and Word 2
      embedding1 = get_embedding(word1)
      embedding2 = get_embedding(word2)
      # Compute cosine similarity
      similarity = F.cosine_similarity(embedding1, embedding2).item()
      cosine_similarities.append(similarity)
      human_scores.append(human_score)

  # Compute Spearman’s correlation coefficient
  correlation, _ = spearmanr(cosine_similarities, human_scores)
  print(f"Spearman’s correlation coefficient: {correlation:.4f}")
  return correlation

In [None]:
# Metric we need to beat: 0.4935
compute_correlation(wordsim)

Spearman’s correlation coefficient: 0.6147


0.6147075607558922

##### Analysis:

Obviously, this model is better then the one we introduced in the section 1.2 (0.6147 vs 0.4935 in terms of correlation). It seems like the reasons are quite clear:

+ Embedding dimension: 1536 vs 100

+ More data (it is reasonable to assume the authors of the model we loaded used more data)

+ More training epochs, more complex architecture, more time, and more resources

+ Different model architecture (as Word2Vec was probably the first neural-based method). Obviously, architectures of such kind of models changed a lot (this one is probably a modified version of the transformer)

### 3.2 Paraphrase Detection Task: In this task, you will use the embeddings

Use either word or sentence embeddings to predict paraphrases. The task
involves identifying whether a pair of sentences are paraphrases of each
other. After obtaining the similarity score between the sentences, classify
the pair as paraphrases if the score is greater than or equal to a predefined
threshold (which should be fixed for the task), or as non-paraphrases if the
score is below the threshold. Report the threshold and accuracy of each
model on this task. Use the dataset msr paraphrase test.txt

##### Discuss the plan

+ Even though the simple way is just to assign an arbitrary threshold for this task, it seems that it is better to do in a more smart way

+ Since we have both the train set and the test set, we can identify the best of the candidates by maximizing the accuracy

+ We will use the best out of out Skip-Gram models to set the threshold

+ After that, we will have it fixed, and run both the Skip-Gram and the pre-trained LM on the test set

+ We will look at the accuracy and at the confusion matrix

+ This will be enough for the evaluation

+ To convert Skip-Gram model to sentence embedding model, we will just take the average of word embeddings

+ As for the pre-trained model, we already defined a get_embedding function, which will work


I will use some code from here: https://www.kaggle.com/code/armandogru/va-a2 as a reference

In [35]:
import re
def clean_text(text):
    text = str(text).lower().strip()
    # Remove punctuation (you can adjust the regex if needed)
    text = re.sub(r"[^\w\s]", "", text)
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text)
    return text

In [36]:
paraphrase_train = pd.read_csv('msr_paraphrase_train.txt', sep='\t', header=0, on_bad_lines='skip')
paraphrase_train = paraphrase_train.loc[paraphrase_train['#1 String'].notna() & paraphrase_train['#2 String'].notna()]
paraphrase_train.index = np.arange(0, len(paraphrase_train))
paraphrase_train

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,702876,702977,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi..."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...
...,...,...,...,...,...
3912,1,1620264,1620507,"At this point, Mr. Brando announced: 'Somebody...","Brando said that ""somebody ought to put a bull..."
3913,0,1848001,1848224,"Martin, 58, will be freed today after serving ...",Martin served two thirds of a five-year senten...
3914,1,747160,747144,We have concluded that the outlook for price s...,"In a statement, the ECB said the outlook for p..."
3915,1,2539933,2539850,The notification was first reported Friday by ...,MSNBC.com first reported the CIA request on Fr...


In [37]:
paraphrase_train.Quality.value_counts() # so there are actually more paraphrases than non-paraphrases

1    2646
0    1271
Name: Quality, dtype: int64

In [38]:
paraphrase_test = pd.read_csv('msr_paraphrase_test.txt', sep='\t', header=0, on_bad_lines='skip')
paraphrase_test = paraphrase_test.loc[paraphrase_test['#1 String'].notna() & paraphrase_test['#2 String'].notna()]
paraphrase_test.index = np.arange(0, len(paraphrase_test))
paraphrase_test

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,1089874,1089925,"PCCW's chief operating officer, Mike Butcher, ...",Current Chief Operating Officer Mike Butcher a...
1,1,3019446,3019327,The world's two largest automakers said their ...,Domestic sales at both GM and No. 2 Ford Motor...
2,1,1945605,1945824,According to the federal Centers for Disease C...,The Centers for Disease Control and Prevention...
3,0,1430402,1430329,A tropical storm rapidly developed in the Gulf...,A tropical storm rapidly developed in the Gulf...
4,0,3354381,3354396,The company didn't detail the costs of the rep...,But company officials expect the costs of the ...
...,...,...,...,...,...
1625,0,2685984,2686122,"After Hughes refused to rehire Hernandez, he c...",Hernandez filed an Equal Employment Opportunit...
1626,0,339215,339172,There are 103 Democrats in the Assembly and 47...,Democrats dominate the Assembly while Republic...
1627,0,2996850,2996734,Bethany Hamilton remained in stable condition ...,"Bethany, who remained in stable condition afte..."
1628,1,2095781,2095812,"Last week the power station’s US owners, AES C...","The news comes after Drax's American owner, AE..."


In [39]:
paraphrase_train['#1 String'] = paraphrase_train['#1 String'].apply(clean_text)
paraphrase_train['#2 String'] = paraphrase_train['#2 String'].apply(lambda x: clean_text(x) if pd.notna(x) else "")

paraphrase_test['#1 String'] = paraphrase_test['#1 String'].apply(clean_text)
paraphrase_test['#2 String'] = paraphrase_test['#2 String'].apply(lambda x: clean_text(x) if pd.notna(x) else "")

In [None]:
save_path = os.path.join('model_checkpoints', '100_knowledge_importance.pt')
checkpoint = torch.load(save_path, map_location=device)
vocab_size = len(vocab)
idx_to_word = dataset.convert_idx_to_word
word_to_idx = dataset.convert_word_to_idx
model_100_knowledge_importance = SkipGramModel(vocab_size, embedding_dim=100).to(device)
model_100_knowledge_importance.load_state_dict(checkpoint)
model_100_knowledge_importance.eval()

In [60]:
def convert_sentence_to_embedding(sentence, model=model_100_knowledge_importance):
    lst_sentence = sentence.split()
    lst_sentence = [word_to_idx(word) for word in lst_sentence if word in vocab]
    embeddings = model.target_embeddings.weight
    sampled_embeddings = embeddings[lst_sentence].cpu().detach().numpy()
    sentence_embedding = np.mean(sampled_embeddings, axis=0)
    return sentence_embedding

In [74]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, confusion_matrix
def similar_or_not(df, threshold, pretrained_model=False):
  df['Model_answer'] = -1
  df['Similarity'] = -1
  for index, row in df.iterrows():
      sentence1 = row['#1 String']
      sentence2 = row['#2 String']
      answer = row['Quality']
      if pretrained_model:
         embedding1 = get_embedding(sentence1, tokenizer=tokenizer)
         embedding2 = get_embedding(sentence2, tokenizer=tokenizer)
         similarity = F.cosine_similarity(embedding1, embedding2).item()
      else:
          embedding1 = convert_sentence_to_embedding(sentence1)
          embedding2 = convert_sentence_to_embedding(sentence2)
          similarity = cosine_similarity([embedding1], [embedding2])[0][0]  # Convert to Python float
      df.at[index, 'Similarity'] = similarity
      if similarity >= threshold:
          df.at[index, 'Model_answer'] = 1
      else:
          df.at[index, 'Model_answer'] = 0
  return df, accuracy_score(df['Quality'], df['Model_answer'])

In [62]:
# the best turned out to be 0.8
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
accuracies = []
for threshold in thresholds:
    print(f"Threshold: {threshold}")
    _, acc_train = similar_or_not(paraphrase_train, threshold)
    print('Accuracy:', acc_train)
    accuracies.append(acc_train)

Threshold: 0.3
Accuracy: 0.6755169772785294
Threshold: 0.4
Accuracy: 0.6755169772785294
Threshold: 0.5
Accuracy: 0.6757722747000255
Threshold: 0.6
Accuracy: 0.6783252489149859
Threshold: 0.7
Accuracy: 0.6880265509318356
Threshold: 0.8
Accuracy: 0.7018126116926219
Threshold: 0.9
Accuracy: 0.6622415113607353


We see that with threshold 0.3 and 0.4 the accuracy is the same, which is weird. The hypothesis is that there is no similarity which is less than 0.4. Let's check the hypothesis

In [63]:
df_analysis, acc_train = similar_or_not(paraphrase_train, 0.8)
df_analysis

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String,Model_answer,Similarity
0,1,702876,702977,amrozi accused his brother whom he called the ...,referring to him as only the witness amrozi ac...,1,0.951396
1,0,2108705,2108831,yucaipa owned dominicks before selling the cha...,yucaipa bought dominicks in 1995 for 693 milli...,1,0.840388
2,1,1330381,1330521,they had published an advertisement on the int...,on june 10 the ships owners had published an a...,1,0.926847
3,0,3344667,3344648,around 0335 gmt tab shares were up 19 cents or...,tab shares jumped 20 cents or 46 to set a reco...,1,0.886066
4,1,1236820,1236712,the stock rose 211 or about 11 percent to clos...,pge corp shares jumped 163 or 8 percent to 210...,1,0.849477
...,...,...,...,...,...,...,...
3912,1,1620264,1620507,at this point mr brando announced somebody oug...,brando said that somebody ought to put a bulle...,1,0.910437
3913,0,1848001,1848224,martin 58 will be freed today after serving tw...,martin served two thirds of a fiveyear sentenc...,1,0.899965
3914,1,747160,747144,we have concluded that the outlook for price s...,in a statement the ecb said the outlook for pr...,1,0.964827
3915,1,2539933,2539850,the notification was first reported friday by ...,msnbccom first reported the cia request on friday,0,0.799303


In [66]:
# Indeed, none of the points
df_analysis.loc[df_analysis['Similarity'] <= 0.4]

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String,Model_answer,Similarity


##### Time to compare the results between two models

**Set threshold to be 0.8 for both models, which might not be a very good idea, since we took an optimal threshold according to the model we trained. Yet, this is what the task asks to do**

In [68]:
df_test_our_model, acc_test_our_model = similar_or_not(paraphrase_test, 0.8)
print('Accuracy (Skip-Gram):', acc_test_our_model)

Accuracy (Skip-Gram): 0.7049079754601227


In [75]:
df_test_pretained_model, acc_test_pretained_model = similar_or_not(paraphrase_test, 0.8, pretrained_model=True)
print('Accuracy (Skip-Gram):', acc_test_pretained_model)

Accuracy (Skip-Gram): 0.6638036809815951


**For threshold=0.8, our model is slightly better than the pre_trained one (which basically assigned 1 to every observation). However, let's see the best possible behavior of the pre_trained model. The metrics for our model and the pre-trained model are 0.705 and 0.664, respectively.**

In [76]:
# check for the pre-trained model
# the best turned out to be 0.8
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
accuracies = []
for threshold in thresholds:
    print(f"Threshold: {threshold}")
    _, acc_train = similar_or_not(paraphrase_train, threshold, pretrained_model=True)
    print('Accuracy:', acc_train)
    accuracies.append(acc_train)

Threshold: 0.3
Accuracy: 0.6755169772785294
Threshold: 0.4
Accuracy: 0.6755169772785294
Threshold: 0.5
Accuracy: 0.6755169772785294
Threshold: 0.6
Accuracy: 0.6755169772785294
Threshold: 0.7
Accuracy: 0.6755169772785294
Threshold: 0.8
Accuracy: 0.6755169772785294
Threshold: 0.9
Accuracy: 0.6757722747000255


This looks very weird, because seems like all the similarities are very high. Let's check

In [77]:
df_analysis_pretrained, acc_train = similar_or_not(paraphrase_train, 0.8, pretrained_model=True)
df_analysis_pretrained

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String,Model_answer,Similarity
0,1,702876,702977,amrozi accused his brother whom he called the ...,referring to him as only the witness amrozi ac...,1,0.993158
1,0,2108705,2108831,yucaipa owned dominicks before selling the cha...,yucaipa bought dominicks in 1995 for 693 milli...,1,0.989980
2,1,1330381,1330521,they had published an advertisement on the int...,on june 10 the ships owners had published an a...,1,0.993228
3,0,3344667,3344648,around 0335 gmt tab shares were up 19 cents or...,tab shares jumped 20 cents or 46 to set a reco...,1,0.991016
4,1,1236820,1236712,the stock rose 211 or about 11 percent to clos...,pge corp shares jumped 163 or 8 percent to 210...,1,0.986909
...,...,...,...,...,...,...,...
3912,1,1620264,1620507,at this point mr brando announced somebody oug...,brando said that somebody ought to put a bulle...,1,0.989343
3913,0,1848001,1848224,martin 58 will be freed today after serving tw...,martin served two thirds of a fiveyear sentenc...,1,0.985217
3914,1,747160,747144,we have concluded that the outlook for price s...,in a statement the ecb said the outlook for pr...,1,0.991138
3915,1,2539933,2539850,the notification was first reported friday by ...,msnbccom first reported the cia request on friday,1,0.980402


In [None]:
# indeed, all the similarities are very high
df_analysis_pretrained.loc[df_analysis_pretrained['Similarity'] <= 0.9]

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String,Model_answer,Similarity
2532,0,2662158,2662046,the standard edition is 15000 per processor or...,the standard edition one is a single processor...,1,0.891606


##### Make sure we did not get any mistakes

**Compare my get_embedding model with the results from the reference_assignment1 notebook (made by the TA)**

In [None]:
# same
F.cosine_similarity(get_embedding('apple'), get_embedding('appple'))

tensor([0.6286], grad_fn=<SumBackward1>)

In [None]:
# same
F.cosine_similarity(get_embedding('apple'), get_embedding('chair'))

tensor([0.6111], grad_fn=<SumBackward1>)

**Make sure what I will check now matches the results obtained at the dataframe**

In [None]:
# same
F.cosine_similarity(get_embedding(df_analysis_pretrained['#1 String'][0]), get_embedding(df_analysis_pretrained['#2 String'][0]))

tensor([0.9932], grad_fn=<SumBackward1>)

In [None]:
# same
F.cosine_similarity(get_embedding(df_analysis_pretrained['#1 String'][3]), get_embedding(df_analysis_pretrained['#2 String'][3]))

tensor([0.9910], grad_fn=<SumBackward1>)

**And, the get_embedding function will return the embedding both for the single word and for the sentence, as it tokenizes the input using its own tokenizer. Hence, seems like there is no mistake in the code I provided, and, indeed, the model just makes all the sentences very close to each other. It is probably a limitation of the model, since it was probably not designed for such a usecase. Probably the better approach will be to split a sentence by words, convert each word to an embedding, and then take the average. However, we will not do this and make a conclusion that our model performed better than the pre-trained one on this particular task, as the pre-trained model made all sentences very close to each other in terms of cosine similarity. Moreover, the accuracy of our model was better (since the pre-trained model just assigned score 1 to every pair of sentences on the threshold=0.8)**

In [94]:
F.cosine_similarity(get_embedding('I am doing an NLP assignment'), get_embedding('somebody once told me the world is gonna roll me'))

tensor([0.9026], grad_fn=<SumBackward1>)

# 4 Discussion and Analysis

In this section, compare the performance of the Word2Vec embeddings with
those generated by the small language model. Analyze the strengths and weaknesses of each approach. Discuss any challenges encountered during implementation and evaluation.

**Basically, I've already written everything needed for this section as my comments while completing tasks in other sections. Hence, I will mention all the important details once again (but very briefly) here:**

On the WordSim353 dataset, the small language model performed better than the best Skip-Gram model we trained (embedding_dimension=100, with significant influence of external knowledge obtained from the WordNet dataset). The results (Spearman correlation) are 0.6147 and 0.4935, respectively. We summarized the reasons as following:

+ Embedding dimension: 1536 vs 100

+ More data (it is reasonable to assume the authors of the model we loaded used more data)

+ More training epochs, more complex architecture, more time, and more resources

+ Different model architecture (as Word2Vec was probably the first neural-based method). Obviously, architectures of such kind of models changed a lot (this one is probably a modified version of the transformer)

On the other hand, our Skip-Gram implementation performed better on another dataset - Paraphrase Detection Task. While both our model and the pre-trained one assign very high similarities (cosine similarity of the dot product of the embeddings) between the sentences, the pre-trained model makes them too high. That is why, given any adequate threshold, the predictions will be constant - according to the model, all the sentences will be the paraphrase of each other. Meanwhile, the Skip-Gram model will correctly predict some negative samples, making it a better model.

We assumed there might be a mistake in code implementation in this task, resulting in very similar embeddings. However, after testing the hypothesis, it turned out that we didn't identify any code mistakes, and the model is really designed this way. Hence, the accuracies on the test set for the Skip-Gram model, and the pre-trained LLM, were approximately 0.705 and 0.664, respectively.

##### Strengths of the custom Word2Vec implementation:

**Domain-Specific customization**: Trained on a specific corpus, the embeddings reflect domain-specific semantics (e.g., medical, legal, or technical jargon). This is critical if the task involves specialized vocabulary or niche language patterns.

**No dependency on external models and sources**: Full control over preprocessing, training data, and hyperparameters (e.g., window size, embedding dimensions).

##### Weaknesses of the custom Word2Vec implementation:

**Context insensitivity**: Assigns a single vector per word, ignoring context (e.g., "bank" as a financial institution vs. a riverbank). This limits performance on tasks requiring nuanced meaning.

**Data, resource, and time requirements**: Requires a large, high-quality corpus to produce meaningful embeddings. Performance degrades with small or noisy datasets. Moreover, computational resources are needed in the ideal case, as Kaggle and Colab might not be enough to train a high-quality model.

**Limited generalization**: Struggles with rare or out-of-vocabulary (OOV) words. Subword information is not captured (unlike models like FastText or other models that use different tokenizations or N-Gram approach).

##### Strengths of using a pre-trained small LLM:

**Contextual embeddings**: LLMs generate dynamic, context-aware vectors (e.g., "bank" in "river bank" vs. "investment bank" has distinct representations). This is critical for tasks like disambiguation, question answering, or sentiment analysis.

**Transfer learning benefits**: Pre-trained models capture rich linguistic patterns (syntax, semantics, and even world knowledge) from vast datasets. Even smaller LLMs outperform Word2Vec on many NLP benchmarks.

**Subword tokenization**: Handles OOV words via subword units (e.g., BERT’s WordPiece), improving generalization to rare or misspelled words.

##### Weaknesses of the pre-trained small LLM:

**Computational overhead**: Inference requires more resources (GPU/CPU and memory), especially for longer texts. Latency may be higher compared to Word2Vec.

**Domain misalignment**: Pre-trained embeddings may not align with domain-specific terminology unless fine-tuned. For example, a general-purpose LLM might poorly represent technical terms in biomedicine.

**Black-Box Nature**: Less interpretable than Word2Vec, making debugging or customization harder.


**Please note that I am discussing the advantages and limitations of the particular models in general terms, not taking into account this homework particular case (as we already discussed it in detail).**

**All the challenges I faced were discussed during the implementation. Mainly they were related to choosing hyperparameters and the parameters of model architectures (how we can make a model both good and not requiring a lot of time and resources).**