## Task 1

### 1. Preparing the data

Download nltk retuers dataset.

In [1]:
# import required libraries
import nltk 
import torch
import random
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

In [2]:
# download the reuters dataset
nltk.download('reuters')
# download stopwords 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/kaungheinhtet/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kaungheinhtet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kaungheinhtet/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/kaungheinhtet/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
# import categories news datasets 
from nltk.corpus import reuters 
# tokenizer
from nltk.tokenize import word_tokenize
# stopwords 
from nltk.corpus import stopwords 
# string
import string

In [4]:
# list of document 

doc_ids = reuters.fileids()
print("# of ids in reuters", len(doc_ids))

# of ids in reuters 10788


In [5]:
type(doc_ids)

list

In [6]:
# print the first document
reuters.raw(doc_ids[0])



In [7]:
# stopwords  and puncutaions words
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

In [8]:
print(stop_words)

{'no', 'own', 'where', 'yourselves', 'for', "i'm", 'here', "should've", 'aren', 'her', 'few', 'theirs', 'before', 'to', "you've", 'your', 'did', 'who', 'over', 'on', "we'll", 'he', "it'd", "i've", 'him', 'further', 'itself', "you'll", 'as', 's', 'can', 'if', "i'll", 'up', 'hadn', "they're", "wouldn't", 'very', "we've", 'my', 'have', 'nor', 'them', 'himself', "they've", 'wouldn', 'because', 't', 'haven', 'in', 'into', 'should', 'yourself', "haven't", 'what', "needn't", "couldn't", "isn't", 'more', 'shan', 'they', "she'd", 'only', 'won', 'we', 'was', 'most', 'had', 'those', 'too', 'ain', 'down', 'or', 'such', "won't", 'but', 'not', 'there', 'each', 'does', "it's", "he's", 'these', 'by', 'which', "mustn't", "we'd", 'doing', 'of', "we're", 'o', 'are', 'll', 'his', 'through', 'after', "shan't", "they'd", "i'd", 'herself', 'while', 'weren', "wasn't", 'didn', 'be', 'with', 'that', 'ourselves', 'hers', 'so', 've', 'any', 'hasn', 'than', 'when', 'again', 'this', 'at', "he'd", 'then', 'how', 'fr

In [9]:
print(punctuation)

{'}', "'", '`', '@', '+', '^', '\\', ':', '#', '!', '?', '$', '<', '|', ',', '_', ']', ')', '"', '-', '~', ';', '>', '{', '&', '*', '=', '(', '[', '/', '%', '.'}


In [10]:
# export all documents from the dataset into an empty list 
documents = []

for doc_id in doc_ids:
    text = reuters.raw(doc_id)
    documents.append(text)

### Tokenization and creating corpus

We tokenize each document into individual words, convert them to lowercase, and remove punctuation and stopwords by checking t.isalpha(). Punctuation does not carry semantic meaning for word embedding models and would introduce unnecessary noise if included. We also remove stopwords, such as “the”, “is”, and “and”, because these words occur very frequently but contribute little semantic information. Removing stopwords reduces noise and improves the quality of learned embeddings by focusing on content-bearing words. This preprocessing step transforms raw text into clean word sequences that can be used to generate context windows for training word embedding models. Document boundaries are preserved to ensure that context windows do not cross unrelated documents.

In [11]:
# create an empty string to perform as a corpus 
tokenized_docs = []

for doc in documents:
    tokens = word_tokenize(doc.lower())
    tokens = [
        t for t in tokens
        if t.isalpha() and t not in stop_words
    ]
    tokenized_docs.append(tokens)

In [12]:
len(tokenized_docs)

10788

In [13]:
tokenized_docs

[['asian',
  'exporters',
  'fear',
  'damage',
  'rift',
  'mounting',
  'trade',
  'friction',
  'japan',
  'raised',
  'fears',
  'among',
  'many',
  'asia',
  'exporting',
  'nations',
  'row',
  'could',
  'inflict',
  'economic',
  'damage',
  'businessmen',
  'officials',
  'said',
  'told',
  'reuter',
  'correspondents',
  'asian',
  'capitals',
  'move',
  'japan',
  'might',
  'boost',
  'protectionist',
  'sentiment',
  'lead',
  'curbs',
  'american',
  'imports',
  'products',
  'exporters',
  'said',
  'conflict',
  'would',
  'hurt',
  'tokyo',
  'loss',
  'might',
  'gain',
  'said',
  'impose',
  'mln',
  'dlrs',
  'tariffs',
  'imports',
  'japanese',
  'electronics',
  'goods',
  'april',
  'retaliation',
  'japan',
  'alleged',
  'failure',
  'stick',
  'pact',
  'sell',
  'semiconductors',
  'world',
  'markets',
  'cost',
  'unofficial',
  'japanese',
  'estimates',
  'put',
  'impact',
  'tariffs',
  'billion',
  'dlrs',
  'spokesmen',
  'major',
  'electronics

### Numericalization 

We have the words corpus now. But, machines cannot understand the words. They only understand numbers. Therefore, we need to transform the words into numbers. In lab codes, we just assign numbers with enumerate. This is ideal for small corpus. However, this approach is not the best with large corpus like news. Here we will count the words and assign their frequency as indexes. This approach can enforce rare words to include in training, in limitation of vocabulary size and reducing further noise.

In [14]:
from collections import Counter 

word_counts = Counter()
for doc in tokenized_docs:
    word_counts.update(doc)

In [15]:
word_counts

Counter({'said': 25381,
         'mln': 18598,
         'vs': 14332,
         'dlrs': 12329,
         'pct': 9771,
         'lt': 8696,
         'cts': 8308,
         'net': 6986,
         'year': 6715,
         'billion': 5809,
         'loss': 5115,
         'would': 4688,
         'company': 4659,
         'shr': 4131,
         'inc': 3930,
         'bank': 3612,
         'corp': 3267,
         'last': 3227,
         'oil': 3193,
         'share': 3084,
         'trade': 3060,
         'profit': 2939,
         'market': 2789,
         'new': 2705,
         'qtr': 2672,
         'shares': 2652,
         'stock': 2625,
         'one': 2536,
         'also': 2532,
         'tonnes': 2509,
         'revs': 2312,
         'two': 2276,
         'sales': 2214,
         'prices': 2193,
         'group': 2111,
         'per': 2058,
         'may': 2057,
         'march': 2005,
         'april': 1973,
         'first': 1878,
         'rate': 1845,
         'japan': 1837,
         'price': 179

In [16]:
# set max occurences to 20,000
MAX_VOCAB = 20000
# if the word counts exceed 20,000 keep the word in most_common
most_common = word_counts.most_common(MAX_VOCAB)
print(most_common)



In [17]:
# create word2index with most common word 
word2index = {w:i for i,(w,c) in enumerate(most_common)}
# set UNK with a unique index like as in our starter
word2index["<UNK>"] = len(word2index)

In [18]:
# get UNK 
UNK = word2index["<UNK>"]
# map with the entire corpus 
# in starter we did this with prepare_sequence method on the fly
# but if we do the same here for the corpus which contains millions of word
# it will slow down the process. Thta's why we index the whole corpus just once 
indexed_docs = []
for doc in tokenized_docs:
    indexed_docs.append([word2index.get(w, UNK) for w in doc])


In [19]:
# check the values
print("Vocab size:", len(word2index))
print("First doc tokens:", tokenized_docs[0][:20])
print("First doc idxs:", indexed_docs[0][:20])


Vocab size: 20001
First doc tokens: ['asian', 'exporters', 'fear', 'damage', 'rift', 'mounting', 'trade', 'friction', 'japan', 'raised', 'fears', 'among', 'many', 'asia', 'exporting', 'nations', 'row', 'could', 'inflict', 'economic']
First doc idxs: [1830, 574, 2166, 972, 11548, 3827, 20, 3215, 41, 402, 1533, 509, 408, 1707, 1726, 312, 1199, 54, 17749, 96]


### 2. Preparing train data

In [20]:
# for doc in indexed_docs:
#     for i, target in enumerate(doc):
#         print(i, target)

In [21]:
def make_skip_grams(indexed_docs, window_size=2):
    """
    A function that create skip_grams from the inidexed list of documents
    with 2 default window size. 

    
    """
    skip_grams = []
    for doc in indexed_docs:
        for i, center in enumerate(doc): 
            start = max(0, i-window_size)
            end = min(len(doc), i+window_size+1)
            for j in range(start, end):
                if j!=i:
                    skip_grams.append((center, doc[j]))
    
    return skip_grams

This approach is fine if we have enough RAM. If we want to make more efficient, we can handle with yield as follow to handle a batch at a time. 

def get_batches(docs, batch_size, window_size):

    inputs = []

    labels = []

    for doc in docs:

        for i, target in enumerate(doc):

            # 1. Define the window boundaries

            start = max(0, i - window_size)

            end = min(len(doc), i + window_size + 1)

            

            # 2. Get the context words (indices)

            # We exclude the target word itself (index i)

            context = [doc[j] for j in range(start, end) if j != i]

            

            # 3. Create (target, context_word) pairs

            for context_word in context:

                inputs.append(target)

                labels.append(context_word)

                

                # 4. If we have enough for a batch, "yield" it

                if len(inputs) == batch_size:

                    yield np.array(inputs), np.array(labels)

                    inputs, labels = [], []


In [22]:
skip_grams = make_skip_grams(indexed_docs)

In [23]:
len(skip_grams)

3294886

3.3 millions pairs is a lot but our RAM can handle here. If the RAM cannot handle, we should stick with the above approach. 

In [24]:
# randomize our skip grams pairs
def random_batch(skip_grams, batch_size):
    random_index = np.random.choice(range(len(skip_grams)), batch_size, replace=False)

    random_inputs = [[skip_grams[idx][0]] for idx in random_index]
    random_labels = [[skip_grams[idx][1]] for idx in random_index]

    return torch.LongTensor(random_inputs), torch.LongTensor(random_labels)

In [25]:
# get randomize input and output batch
input_batch, output_batch = random_batch(skip_grams, batch_size=2)

In [26]:
input_batch.shape

torch.Size([2, 1])

## 3. Model

1. Skip-grams without negative sampling

$$J(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\sum_{\substack{-m \leq j \leq m \\ j \neq 0}}\log P(w_{t+j} | w_t; \theta)$$

where $P(w_{t+j} | w_t; \theta) = $

$$P(o|c)=\frac{\exp(\mathbf{u_o^{\top}v_c})}{\sum_{w=1}^V\exp(\mathbf{u_w^{\top}v_c})}$$

In [27]:
class Skipgram(nn.Module):

    # init function
    def __init__(self, vocab_size, emb_size):
        super().__init__()
        self.embedding_v = nn.Embedding(vocab_size, emb_size)
        self.embedding_u = nn.Embedding(vocab_size, emb_size)
    
    def forward(self, center_words, target_words, all_vocabs):
        center_embeds = self.embedding_v(center_words)
        target_embeds = self.embedding_u(target_words)
        all_embeds      = self.embedding_u(all_vocabs)

        scores = target_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2)
        norm_scores = all_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2)
        nll = -torch.mean(torch.log(torch.exp(scores)/torch.sum(torch.exp(norm_scores), 1).unsqueeze(1)))        
        return nll 

In [28]:
batch_size     = 256 # mini-batch size
embedding_size = 100 
voc_size       = len(word2index) # vocabulary size
skip_model          = Skipgram(voc_size, embedding_size)

optimizer = optim.Adam(skip_model.parameters(), lr=0.001)

In [29]:
# train the model 
import time

num_epochs = 2000

# 1. Create the master list ONCE (Uses almost no RAM)
vocab_idxs = torch.arange(voc_size) 

for epoch in range(num_epochs):
    start = time.time()
    input_batch, target_batch = random_batch(skip_grams, batch_size)

    # 2. Expand it INSIDE the loop (Uses almost zero RAM and is very fast)
    # This ensures it always matches the size of the current batch (even the last one)
    all_vocabs = vocab_idxs.expand(input_batch.size(0), voc_size)

    optimizer.zero_grad()
    loss = skip_model(input_batch, target_batch, all_vocabs)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} | loss: {loss.item():.4f} | time: {time.time()-start:.2f}s")

Epoch 100/2000 | loss: 35.1864 | time: 0.91s
Epoch 200/2000 | loss: 35.0498 | time: 0.97s
Epoch 300/2000 | loss: 32.0791 | time: 0.87s
Epoch 400/2000 | loss: 30.1814 | time: 0.95s
Epoch 500/2000 | loss: 29.8022 | time: 2.35s
Epoch 600/2000 | loss: 29.7815 | time: 0.82s
Epoch 700/2000 | loss: 26.6725 | time: 0.93s
Epoch 800/2000 | loss: 26.3198 | time: 2.88s
Epoch 900/2000 | loss: 27.6003 | time: 2.30s
Epoch 1000/2000 | loss: 25.6601 | time: 0.98s
Epoch 1100/2000 | loss: 24.8357 | time: 0.98s
Epoch 1200/2000 | loss: 23.8909 | time: 1.05s
Epoch 1300/2000 | loss: 24.2046 | time: 0.91s
Epoch 1400/2000 | loss: 24.3271 | time: 1.20s
Epoch 1500/2000 | loss: 23.3277 | time: 1.68s
Epoch 1600/2000 | loss: 22.5264 | time: 1.94s
Epoch 1700/2000 | loss: 22.9271 | time: 1.63s
Epoch 1800/2000 | loss: 23.2933 | time: 1.13s
Epoch 1900/2000 | loss: 20.7216 | time: 1.03s
Epoch 2000/2000 | loss: 20.1706 | time: 1.02s


2. Skip-grams with negative sampling

In [30]:
# unigram distribution 
# create the distribution based on the order of indices in word2index
word_counts_array = np.array([word_counts.get(word, 0) for word, idx in sorted(word2index.items(), key=lambda x: x[1])])

# 1. Apply the 3/4 power rule
noise_dist = np.power(word_counts_array, 0.75)

# 2. Normalize so they sum to 1
probs = noise_dist / np.sum(noise_dist)

In [31]:
probs

array([9.41833349e-03, 7.45919910e-03, 6.13511371e-03, ...,
       4.68373718e-06, 4.68373718e-06, 0.00000000e+00])

In [32]:
class SkipgramNEG(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super(SkipgramNEG, self).__init__()
        self.embedding_v = nn.Embedding(vocab_size, emb_size) # center word
        self.embedding_u = nn.Embedding(vocab_size, emb_size) # context words
        self.log_sigmoid = nn.LogSigmoid()

        # add to force weight to be tiny

        initrange = 0.5 / emb_size
        self.embedding_v.weight.data.uniform_(-initrange, initrange)
        self.embedding_u.weight.data.uniform_(-initrange, initrange)


    def forward(self, center_words, target_words, negative_words):
        # center_words:  [batch_size, 1]
        # target_words:  [batch_size, 1]
        # negative_words: [batch_size, num_neg]

        v = self.embedding_v(center_words) # [batch_size, 1, emb_size]
        u = self.embedding_u(target_words) # [batch_size, 1, emb_size]
        n = self.embedding_u(negative_words) # [batch_size, num_neg, emb_size]

        # Positive score: dot product of v and u
        # We use bmm and then squeeze to get [batch_size, 1]
        positive_score = u.bmm(v.transpose(1, 2)).squeeze(2) 

        # Negative score: dot product of v and all n
        # [batch_size, num_neg, emb_size] bmm [batch_size, emb_size, 1]
        negative_score = n.bmm(v.transpose(1, 2)).squeeze(2) 

        # Loss formula from the paper
        # loss = -self.log_sigmoid(positive_score).mean() - self.log_sigmoid(-negative_score).mean()
        pos_loss = self.log_sigmoid(positive_score)
        neg_loss = self.log_sigmoid(-negative_score).sum(1)

        loss = -(pos_loss.squeeze() + neg_loss).mean()
        
        return loss

In [33]:
batch_size     = 256 
embedding_size = 100
num_epochs     = 2000
num_neg        = 5 # Number of negative samples per positive pair

skip_neg_model          = SkipgramNEG(voc_size, embedding_size)
optimizer      = optim.Adam(skip_neg_model.parameters(), lr=0.001)

for epoch in range(num_epochs):
    start = time.time()
    input_batch, target_batch = random_batch(skip_grams, batch_size)
    
    # Generate negative samples
    # We pick 'num_neg' words for every item in the batch
    neg_batch = np.random.choice(len(probs), size=(input_batch.size(0), num_neg), p=probs)
    neg_batch = torch.LongTensor(neg_batch)

    optimizer.zero_grad()
    # Now passing 3 arguments!
    loss = skip_neg_model(input_batch, target_batch, neg_batch)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} | loss: {loss.item():.4f} | time: {time.time()-start:.2f}s")

Epoch 100/2000 | loss: 4.0864 | time: 0.24s
Epoch 200/2000 | loss: 3.6173 | time: 0.24s
Epoch 300/2000 | loss: 3.3241 | time: 0.24s
Epoch 400/2000 | loss: 3.0470 | time: 0.24s
Epoch 500/2000 | loss: 2.9071 | time: 0.24s
Epoch 600/2000 | loss: 2.7852 | time: 0.24s
Epoch 700/2000 | loss: 2.6887 | time: 0.24s
Epoch 800/2000 | loss: 2.7041 | time: 0.23s
Epoch 900/2000 | loss: 2.7031 | time: 0.23s
Epoch 1000/2000 | loss: 2.5624 | time: 0.23s
Epoch 1100/2000 | loss: 2.5791 | time: 0.23s
Epoch 1200/2000 | loss: 2.5909 | time: 0.23s
Epoch 1300/2000 | loss: 2.5664 | time: 0.23s
Epoch 1400/2000 | loss: 2.5348 | time: 0.24s
Epoch 1500/2000 | loss: 2.5213 | time: 0.23s
Epoch 1600/2000 | loss: 2.4684 | time: 0.24s
Epoch 1700/2000 | loss: 2.4120 | time: 0.23s
Epoch 1800/2000 | loss: 2.4150 | time: 0.25s
Epoch 1900/2000 | loss: 2.4837 | time: 0.24s
Epoch 2000/2000 | loss: 2.4955 | time: 0.24s


3. GloVe

In [34]:
from collections import Counter

# count how many times each pair appears
# This creates the X_ij values for the GloVe formula


co_occurrence_counts = Counter(skip_grams) 
# co_occurrence_counts is roughly: { (word_idx_1, word_idx_2): count, ... }

print(f"Number of unique pairs: {len(co_occurrence_counts)}")

Number of unique pairs: 1313850


In [35]:
# 2. Prepare the batch data
# GloVe needs inputs: [center, context, count]
glove_data = []
for (u, v), count in co_occurrence_counts.items():
    glove_data.append([u, v, count])

glove_data = np.array(glove_data) # Shape: (Num_Pairs, 3)
glove_data = torch.LongTensor(glove_data)
print("GloVe data prepared!")

GloVe data prepared!


In [36]:
class GloVe(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super(GloVe, self).__init__()
        self.embedding_v = nn.Embedding(vocab_size, emb_size) # Center
        self.embedding_u = nn.Embedding(vocab_size, emb_size) # Context
        
        self.v_bias = nn.Embedding(vocab_size, 1) # Bias for center
        self.u_bias = nn.Embedding(vocab_size, 1) # Bias for context
        
        # Initialize small weights (Same trick as Word2Vec!)
        initrange = 0.5 / emb_size
        self.embedding_v.weight.data.uniform_(-initrange, initrange)
        self.embedding_u.weight.data.uniform_(-initrange, initrange)
        self.v_bias.weight.data.uniform_(-initrange, initrange)
        self.u_bias.weight.data.uniform_(-initrange, initrange)

    def forward(self, center_words, target_words, co_occurrences, x_max=100, alpha=0.75):
        # center_words: [batch]
        # target_words: [batch]
        # co_occurrences: [batch] (The raw counts X_ij)
        
        v = self.embedding_v(center_words)
        u = self.embedding_u(target_words)
        b_v = self.v_bias(center_words).squeeze()
        b_u = self.u_bias(target_words).squeeze()
        
        # The Prediction: v . u + b_v + b_u
        # (batch, emb) . (batch, emb) -> (batch)
        prediction = (v * u).sum(dim=1) + b_v + b_u
        
        # The Target: log(X_ij)
        log_x = torch.log(co_occurrences + 1e-9) # add epsilon to avoid log(0)
        
        # The Weighting Function f(X_ij)
        # 1. If count < x_max: (count / x_max)^alpha
        # 2. If count >= x_max: 1.0
        weights = (co_occurrences / x_max).pow(alpha)
        weights = torch.clamp(weights, max=1.0)
        
        # Weighted MSE Loss
        loss = (weights * (prediction - log_x).pow(2)).mean()
        
        return loss

In [37]:
from torch.utils.data import DataLoader, TensorDataset

# Hyperparameters for GloVe
batch_size     = 512 # Can be larger because data is compressed
embedding_size = 100
num_epochs     = 100 
learning_rate  = 0.001

# 1. Create DataLoader for efficient batching
dataset = TensorDataset(glove_data[:, 0], glove_data[:, 1], glove_data[:, 2])
# Shuffle is important for GloVe!
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) 

# 2. Initialize
glove_model = GloVe(len(word2index), embedding_size)
optimizer = optim.Adam(glove_model.parameters(), lr=learning_rate)

# 3. Train
print("Starting GloVe Training...")
for epoch in range(num_epochs):
    start = time.time()
    
    for i, (center_batch, target_batch, count_batch) in enumerate(dataloader):
        optimizer.zero_grad()
        
        # count_batch needs to be float for calculations
        loss = glove_model(center_batch, target_batch, count_batch.float())
        loss.backward()
        optimizer.step()
        

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs} | Loss: {loss.item():.4f} | Time: {time.time()-start:.2f}s")

Starting GloVe Training...
Epoch 10/100 | Loss: 0.0066 | Time: 24.47s
Epoch 20/100 | Loss: 0.0065 | Time: 24.45s
Epoch 30/100 | Loss: 0.0017 | Time: 24.62s
Epoch 40/100 | Loss: 0.0030 | Time: 24.81s
Epoch 50/100 | Loss: 0.0028 | Time: 24.46s
Epoch 60/100 | Loss: 0.0054 | Time: 24.46s
Epoch 70/100 | Loss: 0.0035 | Time: 24.94s
Epoch 80/100 | Loss: 0.0029 | Time: 24.71s
Epoch 90/100 | Loss: 0.0038 | Time: 24.69s
Epoch 100/100 | Loss: 0.0038 | Time: 24.95s


## Task 2

In [71]:
def accuracy_test(model, word2index, file='word-test.v1.txt'):
    print(f"--- Evaluating Model ---")
    
    # 1. Detect Model Type
    if hasattr(model, 'embedding_v'): # Your Custom Model
        embeddings = model.embedding_v.weight.detach().cpu().numpy()
        index2word = {v: k for k, v in word2index.items()}
        def get_index(w): return word2index.get(w)
    else: # Gensim
        embeddings = model.vectors
        word2index = model.key_to_index
        index2word = model.index_to_key 
        def get_index(w): return word2index.get(w)
        
    # Normalize
    norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
    norms = norms.reshape(-1, 1)
    embeddings_norm = embeddings / norms
    
    sem_correct = 0; sem_total = 0
    syn_correct = 0; syn_total = 0
    
    # 2. Robust Lookup Helper (Case Insensitive)
    def get_vec_idx(word):
        if word in word2index: return word2index[word]
        if word.lower() in word2index: return word2index[word.lower()]
        return None

    with open(file, 'r') as f:
        current_section = None
        for line in f:
            if line.startswith(':'):
                # FIXED: This matches your file's header
                current_section = line.split()[1].strip()
                continue
            
            # FIXED: Updated to 'gram7-past-tense'
            if current_section not in ['capital-common-countries', 'gram7-past-tense']:
                continue
            
            words = line.split() 
            if len(words) != 4: continue
            a, b, c, expected = words
            
            # Get Indices
            a_idx = get_vec_idx(a)
            b_idx = get_vec_idx(b)
            c_idx = get_vec_idx(c)
            
            if a_idx is None or b_idx is None or c_idx is None:
                continue 
            
            # Math
            target_vec = embeddings_norm[b_idx] - embeddings_norm[a_idx] + embeddings_norm[c_idx]
            
            # Search
            scores = np.dot(embeddings_norm, target_vec)
            sorted_indices = np.argsort(scores)[::-1]
            
            prediction = None
            for idx in sorted_indices[:5]:
                pred_word = index2word[idx]
                if pred_word.lower() not in [a.lower(), b.lower(), c.lower()]:
                    prediction = pred_word
                    break
            
            # Check result
            if prediction and prediction.lower() == expected.lower():
                if current_section == 'capital-common-countries': sem_correct += 1
                elif current_section == 'gram7-past-tense': syn_correct += 1
                
            if current_section == 'capital-common-countries': sem_total += 1
            elif current_section == 'gram7-past-tense': syn_total += 1

    print(f"Semantic Acc: {sem_correct}/{sem_total} ({sem_correct/sem_total*100 if sem_total > 0 else 0:.2f}%)")
    print(f"Syntactic Acc: {syn_correct}/{syn_total} ({syn_correct/syn_total*100 if syn_total > 0 else 0:.2f}%)")

In [72]:
# compute accuracy for all model 
accuracy_test(skip_model, word2index, file="word-test.v1.txt")

--- Evaluating Model ---
Semantic Acc: 0/272 (0.00%)
Syntactic Acc: 0/1024 (0.00%)


In [73]:
accuracy_test(skip_neg_model, word2index, file="word-test.v1.txt")

--- Evaluating Model ---
Semantic Acc: 3/272 (1.10%)
Syntactic Acc: 0/1024 (0.00%)


In [74]:
accuracy_test(glove_model, word2index, file="word-test.v1.txt")

--- Evaluating Model ---
Semantic Acc: 0/272 (0.00%)
Syntactic Acc: 4/1024 (0.39%)


In [65]:
import gensim.downloader as api

In [66]:
print("Downloading Gensim GloVe model (approx 128MB)...")
# use 'glove-wiki-gigaword-100' because your custom models are also dim=100
glove_gensim = api.load("glove-wiki-gigaword-100") 
print("Model Loaded successfully!")

Downloading Gensim GloVe model (approx 128MB)...
Model Loaded successfully!


In [75]:
accuracy_test(glove_gensim, word2index, file="word-test.v1.txt")

--- Evaluating Model ---
Semantic Acc: 475/506 (93.87%)
Syntactic Acc: 865/1560 (55.45%)


| Model | Window Size | Training Loss | Training Time | Syntactic Accuracy | Semantic Accuracy |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Skipgram** | 2 | 20.3656 | 43 min 18 sec | 0%| 0% |
| **Skipgram (NEG)** | 2 | 2.4099 | 8 min 21 sec | 0% | 1.10% |
| **GloVe** | 2 | 0.0045 | 40 min 51 sec | 0.39% | 0% |
| **GloVe (Gensim)** | - | - | - | 55.45% | 93.87% |

We got 0% accuracy in semantic analysis because our custom models are trained on the Reuters which are financial news. his dataset is highly specialized, focusing on economic terminology, corporate earnings, and trade. It lacks the broad geographical and cultural diversity required to establish strong "Capital-Country" vector relationships (e.g., Paris to France) found in the capital-common-countries dataset. Furthermore, the Reuters dataset is significantly smaller than the multi-billion word Wikipedia corpus used to train the Gensim GloVe model, which explains why the pre-trained model performs so much better on these abstract relationships.

One of our custom models, GloVe achieved low but non-zero accuracy on syntactic tasks at 0.39%, while the pre-trained Gensim model achieved significantly higher scores at approximately 55%. Syntactic relationships like verb tenses often depend on immediate local context, words appearing directly next to each other. Because we used a window size of 2, the models were able to capture these local grammatical patterns better than broad semantic ones. I also believe that 
the massive performance gap between our models and Gensim is due to the sheer volume of data. Syntactic rules have many "irregular" forms (e.g., go $\rightarrow$ went vs. walk $\rightarrow$ walked). A small corpus like Reuters may not contain enough examples of irregular verbs for the model to generalize the "past-tense" rule, whereas Wikipedia (used for Gensim) contains millions of examples of every verb form6666.

### Similarity Test