# **FS25 NLP Project 1: Word Embeddings/Recurrent Neural Networks**

Fabian Dubach

# **Introduction**

The task for this project was to answer common sense questions with the usage of two different architectures: Word embeddings (word2vec, GloVe or fastText) with a classifier and a 2-layer RNN architecture with a classifier (LSTM or GRU). We had to also track the trainings with Wandb (workspace URL: https://wandb.ai/fabian-dubach-hochschule-luzern/CommonsenseQA?nw=nwuserfabiandubach).

# **Setup**

Import all libraries needed to run the code.

In [None]:
import os
import time
import traceback
from datetime import datetime
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim

from datasets import load_dataset
from huggingface_hub import hf_hub_download

from tqdm import tqdm, trange
import wandb

Setup random seed to ensure reproducibility.

_Info about the seed value: The field of natural language processing began in the 1940s, after World War II. At this time, people recognized the importance of translation from one language to another and hoped to create a machine that could do this sort of translation automatically._

In [None]:
SEED = 1940

np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Download tokenizer files.

In [None]:
nltk.download('punkt_tab')
nltk.download('stopwords')

Load pre-trained FastText word embeddings (300 dimensions)

I first wanted to choose GloVe, because I've seen that GloVe performs well on semantic similarity and analogical reasoning. Due to the fact that GloVe can only handle uncased embeddings (lowercase), I chose to use FastText. I used 300 dimensions, because it represents word meanings more completely than smaller options (50- or 100 dimensions) while still being practical to use.

In [None]:
model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")

In [None]:
fasttext_model = gensim.models.fasttext.load_facebook_model(model_path)
wv = fasttext_model.wv

Look at vector- and vocab size from the loaded embedding.

In [None]:
print("Vector size:", wv.vector_size)
print("Vocab size:", len(wv.index_to_key))

Check if known and unknown words create vectors.

In [None]:
print(wv["Hello"])
print(wv["jwadAJKJDwljlkdajl"])

For the project, we had to use the CommonsenseQA dataset, which is a multiple-choice question answering dataset that contains 12'247 different questions and was developed to benchmark machine understanding of everyday knowledge. For each questions there are 5 given answer choices, where only one of them is correct. To be able to answer these questions, "commonsense" is needed. The dataset is available on HuggingFace: https://huggingface.co/datasets/tau/commonsense_qa.

I split the dataset into training, validation and test sets to allow for model development and evaluation. I used the last 1'000 examples from the training set for validation and the original validation set for testing, since the real test set has no answer keys.

In [None]:
train = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

Login for the experiment tracking.

In [None]:
wandb.login()

# **Data Exploration**

In this section I tried to get some insight to understand its structure and patterns.

### 1. Explore dataset structure

In [None]:
print("\033[4m" + "Dataset Features" + "\033[0m")
for feature in train.features:
    print(feature)
print("\n" + "\033[4m" + "Example" + "\033[0m")
for feature in train.features:
    print(feature + ":", train[0][str(feature)])

### 2. Get a general info about each dataset

In [None]:
def dataset_to_df(dataset):
    return pd.DataFrame(dataset)

train_df = dataset_to_df(train)
valid_df = dataset_to_df(valid)
test_df = dataset_to_df(test)

In [None]:
print("\033[4m" + "Train Info" + "\033[0m")
print(train_df.info())

In [None]:
print("\033[4m" + "Validation Info" + "\033[0m")
print(valid_df.info())

In [None]:
print("\033[4m" + "Test Info" + "\033[0m")
print(test_df.info())

### 3. Analyze question lengths

In [None]:
combined_df = pd.concat([train_df, valid_df, test_df], ignore_index=True)

combined_df['question_length'] = combined_df['question'].apply(len)
combined_df['question_word_count'] = combined_df['question'].apply(lambda x: len(x.split()))

print("\033[4m" + "Question length (characters)" + "\033[0m")
print(f"Min: {combined_df['question_length'].min()}")
print(f"Max: {combined_df['question_length'].max()}")
print(f"Mean: {combined_df['question_length'].mean():.2f}")
print(f"Median: {combined_df['question_length'].median()}")

print("\n\033[4m" + "Question Word Count" + "\033[0m")
print(f"Min: {combined_df['question_word_count'].min()}")
print(f"Max: {combined_df['question_word_count'].max()}")
print(f"Mean: {combined_df['question_word_count'].mean():.2f}")
print(f"Median: {combined_df['question_word_count'].median()}")

### 4. Analyze option lengths

In [None]:
def get_option_lengths(choices):
    return [len(text) for text in choices['text']]

def get_option_word_counts(choices):
    return [len(text.split()) for text in choices['text']]

combined_df['option_lengths'] = combined_df['choices'].apply(get_option_lengths)
combined_df['option_word_counts'] = combined_df['choices'].apply(get_option_word_counts)

# Flatten the lists for analysis
all_option_lengths = [length for lengths in combined_df['option_lengths'] for length in lengths]
all_option_word_counts = [count for counts in combined_df['option_word_counts'] for count in counts]

print("\033[4m" + "Option length (characters)" + "\033[0m")
print(f"Min: {min(all_option_lengths)}")
print(f"Max: {max(all_option_lengths)}")
print(f"Mean: {np.mean(all_option_lengths):.2f}")
print(f"Median: {np.median(all_option_lengths)}")

print("\033[4m" + "\nOption word count" + "\033[0m")
print(f"Min: {min(all_option_word_counts)}")
print(f"Max: {max(all_option_word_counts)}")
print(f"Mean: {np.mean(all_option_word_counts):.2f}")
print(f"Median: {np.median(all_option_word_counts)}")

### 5. Analyze answer distribution

In [None]:
def extract_answer_letter(example):
    return example['answerKey']

combined_df['answer_letter'] = combined_df.apply(extract_answer_letter, axis=1)

print("\033[4m" + "Answer Distribution" + "\033[0m")
print(combined_df['answer_letter'].value_counts(), "\n")
print(combined_df['answer_letter'].value_counts(normalize=True).mul(100).round(2).astype(str) + '%')

### 6. Extract common question words/phrases

In [None]:
def get_common_words(text_series, top_n=20):
    stop_words = set(stopwords.words('english'))
    all_words = []
    
    for text in text_series:
        words = word_tokenize(text.lower())
        filtered_words = [word for word in words if word.isalnum() and word not in stop_words]
        all_words.extend(filtered_words)
    
    return Counter(all_words).most_common(top_n)

print("\033[4m" + "Common Words in Questions" + "\033[0m")
common_words = get_common_words(train_df['question'], 10)
for word, count in common_words:
    print(f"{word}: {count}")

### 7. Visualize question length distribution

In [None]:
plt.figure(figsize=(8, 4))
sns.histplot(combined_df['question_word_count'], bins=20, kde=True)
plt.title('Distribution of Question Word Count')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

### 8. Visualize answer distribution

In [None]:
plt.figure(figsize=(8, 4))
sns.countplot(x='answer_letter', data=combined_df, order=combined_df['answer_letter'].value_counts().index)
plt.title('Distribution of Answers')
plt.xlabel('Answer Option')
plt.ylabel('Count')

# **Preprocessing**

For the preprocessing I looked at the following points:

1. Tokenization
2. Lowercasing, stemming, lemmatizing, stopword/punctuation removal 
3. Removal of unknown/other words 
4. Format cleaning (e.g. html-extracted text) 
5. Truncation 
6. Feature selection 

Here are my decisions and justifications for using or not using the above listed preprocessing methods:

1. Tokenization is absolutely mandatory.
2. I chose not to use lowercasing to keep the semantic meaning of the words. Stemming and lemmatizing are not needed, because FastText already captures semantic similarities. Stopword/Punctuation removal is generally not a needed for a RNN due to the fact that the model can then learn to ignore irrelevant words by itself.
3. Removal of unknown/other words is also not needed, because FastText can handle them.
4. Format cleaning is not needed, because the CommonsenseQA dataset doesn't include any markup text.
5. Due to the fact that the longest question is 376 characters long, truncation is not needed.
6. Feature selection is also not needed. RNNs typically work with the full sequence rather than selected features.

In [None]:
def preprocessing(text):
    tokens = nltk.tokenize.word_tokenize(text)
    return tokens

The get_embedding function transforms any text sentence into a fixed-length vector representation by averaging the word embeddings of each token in the sentence. If no tokens were found, a zero vector is being returned.

In [None]:
def get_embedding(sentence):
    tokens = preprocessing(sentence)
    word_vectors = []
    for token in tokens:
        try:
            word_vectors.append(fasttext_model.wv[token])
        except KeyError:
            # Skip tokens not in vocabulary
            continue
    
    # Return the mean of the word vectors
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(fasttext_model.vector_size)  # Return a zero vector if no tokens were found

For the model to interpret the answers correctly, I converted the answer keys into numerical indices (0, 1, 2, 3, 4).

In [None]:
def answer_key_to_index(answer_key):
  return ord(answer_key) - ord("A")

The compute_embeddings function creates embeddings for every text data (questions and choices).

In [None]:
def compute_embeddings(example):
    question_embeddings = get_embedding(example["question"])
    choice_embeddings = [get_embedding(choice) for choice in example["choices"]["text"]]
    
    example["question_emb"] = question_embeddings.tolist()
    example["choice_embs"] = [embedding.tolist() for embedding in choice_embeddings]
    return example

train = train.map(compute_embeddings)
valid = valid.map(compute_embeddings)

### **Word Embedding**

Implement a class which can convert a regular dataset into a PyTorch-compatible dataset.

In [None]:
class CommonsenseQADataset(Dataset):
    def __init__(self, dataset):
        self.data = dataset

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        example = self.data[idx]
        question_tensor = torch.tensor(example["question_emb"]).float()
        choices_tensor = torch.tensor(example["choice_embs"]).float()
        answer_index = answer_key_to_index(example["answerKey"])
        return question_tensor, choices_tensor, torch.tensor(answer_index).long()

### **RNN**

Implement class that creates a PyTorch-compatible dataset for processing question-answering pairs through an RNN model. It tokenizes text, combines each question with each possible answer (separated by a special token), converts tokens to embedding vectors and returns these sequences along with their lengths and the correct answer index.

In [None]:
class CommonsenseQARNNDataset(Dataset):
    def __init__(self, hf_dataset, word_vectors, embedding_dim=300):
        self.data = hf_dataset
        self.wv = word_vectors
        self.embedding_dim = embedding_dim
        self.SEP_TOKEN = "<SEP>"
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        example = self.data[idx]
        
        # Tokenize question and choices
        question_tokens = preprocessing(example["question"])
        choice_tokens = [preprocessing(choice) for choice in example["choices"]["text"]]
        
        # Create sequences and lengths for each choice
        sequences = []
        lengths = []
        for choice in choice_tokens:
            # Combine question and choice
            full_sequence = question_tokens + [self.SEP_TOKEN] + choice
            
            # Convert to embeddings
            embeddings = []
            for token in full_sequence:
                try:
                    # Use pretrained embedding
                    if token == self.SEP_TOKEN:
                        embeddings.append(torch.randn(self.embedding_dim) * 0.1)
                    else:
                        embeddings.append(torch.tensor(self.wv[token]))
                except KeyError:
                    # For OOV words, use random embedding
                    embeddings.append(torch.randn(self.embedding_dim) * 0.1)
            
            # Convert to tensor
            sequences.append(torch.stack(embeddings))
            lengths.append(len(embeddings))
        
        # Convert answer to index
        answer = ord(example["answerKey"]) - ord("A")
        
        return sequences, torch.tensor(lengths), torch.tensor(answer)

This function prepares batches of sequence data for efficient RNN processing.

1. Collect and flatten all sequences, lengths and answers from the batch
2. Sort sequences by length in descending order (optimizing RNN computation)
3. Pad shorter sequences to match the longest one
4. Create a mapping index to help reconstruct the original batch organization
5. Return the padded sequences, their sorted lengths, reconstruction indices and answer labels

The sorting step is important for using packed sequences in RNNs, which improves efficiency by processing only valid parts of each sequence, while the indices allow the model to match processed sequences back to their original question-answer pairs.

In [None]:
def rnn_collate_batch(batch):
    # Separate sequences, lengths, and answers
    all_sequences = []
    all_lengths = []
    all_answers = []
    
    batch_size = len(batch)
    num_choices = len(batch[0][0])  # Number of choices per example
    
    for sequences, lengths, answer in batch:
        all_sequences.extend(sequences)
        all_lengths.append(lengths)
        all_answers.append(answer)
    
    # Combine and sort lengths
    lengths_tensor = torch.cat(all_lengths)
    sorted_indices = torch.argsort(lengths_tensor, descending=True)
    
    # Reorder sequences and lengths based on sorted indices
    sorted_sequences = [all_sequences[i] for i in sorted_indices]
    sorted_lengths = lengths_tensor[sorted_indices]
    
    # Pad sequences
    padded_sequences = pad_sequence(sorted_sequences, batch_first=True)
    
    # Create indices to reconstruct original batch order
    indices = torch.arange(len(sorted_indices)).view(batch_size, num_choices)
    
    # Stack answers
    answers_tensor = torch.stack(all_answers)
    
    return padded_sequences, sorted_lengths, indices, answers_tensor

# **Model**

### Word embeddings

This WordEmbeddingQAClassifier is a neural network model for multiple-choice question answering using pre-computed word embeddings.

1. Take already-embedded question and choice vectors as input
2. Expand the question embedding to pair with each answer choice
3. Concatenate each question-choice pair along the feature dimension
4. Processe these concatenated vectors through a simple feed-forward network:
    - A hidden layer with ReLU activation and dropout for regularization
    - An output layer that produces a single score for each question-choice pair



The model essentially measures compatibility between questions and potential answers, with higher scores indicating better matches. During training, these scores are compared against the correct answer to tune the model's parameters.

In [None]:
class WordEmbeddingQAClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, dropout_rate):
        super(WordEmbeddingQAClassifier, self).__init__()

        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim

        # First layer: concatenated embedding dimension
        self.fc1 = nn.Linear(2 * embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        
        # Output layer for classification
        self.fc2 = nn.Linear(hidden_dim, 1)

    def forward(self, question, choices):
        # Expand question to match choices dimension
        question_expanded = question.unsqueeze(1).expand(-1, choices.size(1), -1)
        
        # Concatenate question and choice embeddings
        combined = torch.cat((question_expanded, choices), dim=2)

        # First layer with ReLU and Dropout
        x = self.fc1(combined)
        x = self.relu(x)
        x = self.dropout(x)
        
        # Final layer
        x = self.fc2(x)
        return x.squeeze(-1)

### RNN

This QARNNModel class implements a neural network for question answering using an LSTM (Long Short-Term Memory) network. The reason why I chose the LSTM network is, that I think the LSTM's memory cell structure might better preserve information across the combined question-answer sequences, especially when there are key contextual elements that need to be remembered from earlier parts of the sequence to evaluate answer choices properly.

1. Process sequences of word embeddings through a bidirectional LSTM with 2 layers
2. The LSTM handles variable-length sequences efficiently by using packed sequences
3. After processing through the LSTM, it extracts the final hidden states from both directions (forward and backward) and concatenates them
4. These concatenated hidden states are passed through a classification head (a simple feed-forward network)
5. The classification head outputs a score for each sequence
6. The scores are then rearranged to match each question with its multiple answer choices using the indices parameter

The bidirectional nature of the LSTM allows the model to incorporate context from both before and after each word in the sequence. The model combines the question and each potential answer into a single sequence, processes them through the LSTM, and outputs scores indicating which answer is most likely correct.

In [None]:
class QARNNModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim=128, num_choices=5, dropout_rate=0.2):
        super(QARNNModel, self).__init__()
        
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.num_choices = num_choices
        
        # 2-layer bidirectional LSTM
        self.lstm = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=2,
            batch_first=True,
            dropout=dropout_rate,
            bidirectional=True
        )
        
        # Classification head
        lstm_output_dim = hidden_dim * 2  # bidirectional = *2
        self.classifier = nn.Sequential(
            nn.Linear(lstm_output_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, padded_sequences, sequence_lengths, indices):
        """Process all sequences together and then reshape for classification"""
        batch_size = indices.size(0)
        
        # Pack padded sequences with enforce_sorted=False
        packed = pack_padded_sequence(
            padded_sequences, 
            sequence_lengths.cpu(), 
            batch_first=True,
            enforce_sorted=False
        )
        
        # Run through LSTM
        _, (hidden, _) = self.lstm(packed)
        
        # Get final hidden states from both directions
        final_hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
        
        # Process through classifier
        logits = self.classifier(final_hidden).squeeze(-1)
        
        # Rearrange back to original order
        all_logits = torch.zeros(batch_size, self.num_choices, device=padded_sequences.device)
        for batch_idx in range(batch_size):
            for choice_idx in range(self.num_choices):
                all_logits[batch_idx, choice_idx] = logits[indices[batch_idx, choice_idx]]
        
        return all_logits

# **Training**

 Check if a CUDA-compatible GPU is available and set the device for tensor computations accordingly, otherwise use CPU.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

This save_chekpoint function saves model checkpoints during training by storing the model's state, optimizer state, current epoch and best validation accuracy to a file path. When saving the best-performing model, it also logs this information to Weights & Biases (wandb) for tracking, with error handling in case the logging fails.

In [None]:
def save_checkpoint(model, optimizer, epoch, best_val_accuracy, save_path):
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'best_val_accuracy': best_val_accuracy
    }
    torch.save(checkpoint, save_path)
    try:
        if 'best_model' in save_path:
            wandb.log({
                "best_model_checkpoint": {
                    "path": save_path,
                    "epoch": epoch,
                    "val_accuracy": best_val_accuracy
                }
            })
    except Exception as e:
        print(f"Error logging checkpoint to wandb: {e}")

### **Word Embedding Training**

Set hyperparameters for the word embedding model.

In [None]:
embedding_dim_word_embedding = 300 # FastText -> 300d
hidden_dim_word_embedding = 64
dropout_rate_word_embedding = 0.2
learning_rate_word_embedding = 1e-4
weight_decay_word_embedding = 1e-5
batch_size_word_embedding = 32
num_epochs_word_embedding = 100

Create a PyTorch-compatible dataset for the train and the validation dataset.

In [None]:
train_dataset_word_embedding = CommonsenseQADataset(train)
valid_dataset_word_embedding = CommonsenseQADataset(valid)

Create two PyTorch DataLoader objects to efficiently batch and load data during training.

In [None]:
train_loader = DataLoader(
    train_dataset_word_embedding, 
    batch_size=batch_size_word_embedding, 
    shuffle=True, 
    num_workers=4,
    pin_memory=True
)

valid_loader = DataLoader(
    valid_dataset_word_embedding, 
    batch_size=batch_size_word_embedding, 
    shuffle=False,  # No need to shuffle validation data
    num_workers=4,
    pin_memory=True
)

Initialize the word embedding model and move it to the appropriate computing device.

In [None]:
word_embedding_model = WordEmbeddingQAClassifier(embedding_dim=embedding_dim_word_embedding, hidden_dim=hidden_dim_word_embedding, dropout_rate=dropout_rate_word_embedding)
word_embedding_model = word_embedding_model.to(device)

print(word_embedding_model)

Set up the loss function and optimizer for training the word embedding model.

In [None]:
criterion_word_embedding = nn.CrossEntropyLoss()
optimizer_word_embedding = torch.optim.AdamW(word_embedding_model.parameters(), lr=learning_rate_word_embedding, weight_decay=weight_decay_word_embedding)

Initializes wandb tracking run for experiment monitoring of the word embedding model training.

In [None]:
word_embedding_run = wandb.init(
  project="CommonsenseQA",
  name=f"word_embedding-{datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}",
  config={
    "model": "word_embedding",
    "embedding_dim": embedding_dim_word_embedding,
    "hidden_dim": hidden_dim_word_embedding,
    "batch_size": batch_size_word_embedding,
    "epoch": num_epochs_word_embedding,
    "dropout_rate": dropout_rate_word_embedding,
    "weight_decay": weight_decay_word_embedding,
  },
  reinit=True,
)

The train_word_embedding() function is the training loop function for a the word embedding model. It handles training, validation, logging and model checkpointing

In [None]:
def train_word_embedding(
    model, 
    train_loader, 
    valid_loader, 
    criterion, 
    optimizer, 
    num_epochs, 
    device, 
    checkpoint_dir='checkpoints', 
    save_interval=1,
):
    # Create checkpoint directory if it doesn't exist
    os.makedirs(checkpoint_dir, exist_ok=True)
    
    # Initialize best validation accuracy
    best_word_embedding_accuracy = 0
    start_epoch = 0
    
    # Training loop
    for epoch in (pbar := trange(start_epoch, num_epochs)):
        pbar.set_description(f"Epoch {epoch+1}/{num_epochs}")

        model.train()
        train_total_loss = 0.0
        train_correct = 0
        train_total = 0

        for question_batch, choices_batch, y_batch in train_loader:
            optimizer.zero_grad()

            question_batch = question_batch.to(device)
            choices_batch = choices_batch.to(device)
            y_batch = y_batch.to(device)

            # Forward pass
            outputs = model(question_batch, choices_batch)

            # Compute loss
            train_batch_loss = criterion(outputs, y_batch)
            train_total_loss += train_batch_loss.item()

            # Compute accuracy
            train_predictions = torch.argmax(outputs, dim=1)
            train_correct += (train_predictions == y_batch).sum().item()
            train_total += y_batch.size(0)

            # Backward pass
            train_batch_loss.backward()
            optimizer.step()

        # Calculate train statistics
        avg_train_loss = train_total_loss / len(train_loader)
        train_accuracy = train_correct / train_total

        # Evaluate
        model.eval()
        val_correct = 0
        val_total = 0
        val_total_loss = 0

        with torch.no_grad():
            for question_batch, choices_batch, y_batch in valid_loader:
                question_batch = question_batch.to(device)
                choices_batch = choices_batch.to(device)
                y_batch = y_batch.to(device)

                val_outputs = model(question_batch, choices_batch)

                # Calculate validation loss
                val_batch_loss = criterion(val_outputs, y_batch)
                val_total_loss += val_batch_loss.item()

                val_predictions = torch.argmax(val_outputs, dim=1)
                val_correct += (val_predictions == y_batch).sum().item()
                val_total += y_batch.size(0)

        # Calculate validation statistics
        avg_val_loss = val_total_loss / len(valid_loader)
        val_accuracy = val_correct / val_total
        
        pbar.set_postfix({
            "train_loss": avg_train_loss, 
            "train_acc": train_accuracy, 
            "val_acc": val_accuracy
        })

        # Log metrics to wandb
        wandb.log({
            "epoch": epoch,
            "train_loss": avg_train_loss,
            "train_accuracy": train_accuracy,
            "val_loss": avg_val_loss,
            "val_accuracy": val_accuracy,
            "learning_rate": optimizer.param_groups[0]['lr'],
        })

        # Save checkpoint periodically
        if (epoch + 1) % save_interval == 0:
            checkpoint_path = os.path.join(checkpoint_dir, f'checkpoint_epoch_{epoch+1}.pt')
            save_checkpoint(model, optimizer, epoch+1, checkpoint_path, best_word_embedding_accuracy)

        # Save best model based on validation accuracy
        if val_accuracy > best_word_embedding_accuracy:
            best_word_embedding_accuracy = val_accuracy
            best_model_path = os.path.join(checkpoint_dir, 'best_model.pt')
            save_checkpoint(model, optimizer, epoch+1, best_model_path, best_word_embedding_accuracy)

    # Final save
    final_model_path = os.path.join(checkpoint_dir, 'final_model.pt')
    save_checkpoint(model, optimizer, num_epochs, final_model_path, best_word_embedding_accuracy)
    
    wandb.finish()
    
    return model, best_word_embedding_accuracy

Define checkpoint directory for the word embedding model.

In [None]:
checkpoint_dir_embedding_model = f"./checkpoints/embedding-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
os.makedirs(checkpoint_dir_embedding_model, exist_ok=True)

Run the training for the word embedding model.

In [None]:
model, best_word_embedding_accuracy = train_word_embedding(
    model=word_embedding_model,
    train_loader=train_loader,
    valid_loader=valid_loader,
    criterion=criterion_word_embedding,
    optimizer=optimizer_word_embedding,
    num_epochs=num_epochs_word_embedding,
    device=device,
    checkpoint_dir=checkpoint_dir_embedding_model,
    save_interval=1  # Save a checkpoint every epoch
)

print(f"Word embedding training complete! Best validation accuracy: {best_word_embedding_accuracy:.4f}")

### **RNN Training**

Set hyperparameters for the word embedding model.

In [None]:
embedding_dim_rnn = 300 # FastText -> 300d
hidden_dim_rnn = 128
dropout_rate_rnn =0.2
learning_rate_rnn = 1e-4
weight_decay_rnn = 1e-5
batch_size_rnn = 128
num_epochs_rnn = 100

Create a PyTorch-compatible dataset for the train and the validation dataset.

In [None]:
train_rnn_dataset = CommonsenseQARNNDataset(train, wv,embedding_dim=embedding_dim_rnn)
valid_rnn_dataset = CommonsenseQARNNDataset(valid, wv, embedding_dim=embedding_dim_rnn)

Create two PyTorch DataLoader objects to efficiently batch and load data during training.

In [None]:
train_rnn_loader = DataLoader(
    train_rnn_dataset,
    batch_size=batch_size_rnn,
    shuffle=True,
    collate_fn=rnn_collate_batch,
    num_workers=4,
    pin_memory=True
)

valid_rnn_loader = DataLoader(
    valid_rnn_dataset,
    batch_size=batch_size_rnn,
    shuffle=False,
    collate_fn=rnn_collate_batch,
    num_workers=4,
    pin_memory=True
)

Initialize the RNN model and move it to the appropriate computing device.

In [None]:
rnn_model = QARNNModel(embedding_dim=embedding_dim_rnn, hidden_dim=hidden_dim_rnn, dropout_rate=dropout_rate_rnn)
rnn_model = rnn_model.to(device)

print(rnn_model)

Set up the loss function and optimizer for training the word embedding model.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(rnn_model.parameters(), lr=learning_rate_rnn, weight_decay=weight_decay_rnn)

Initializes wandb tracking run for experiment monitoring of the RNN training.

In [None]:
rnn_run = wandb.init(
  project="CommonsenseQA",
  name=f"rnn-{datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}",
  config={
    "model": "rnn",
    "embedding_dim": embedding_dim_rnn,
    "hidden_dim": hidden_dim_rnn,
    "batch_size": batch_size_rnn,
    "epoch": num_epochs_rnn,
    "dropout_rate": dropout_rate_rnn,
    "weight_decay": weight_decay_word_embedding,
  },
  reinit=True,
)

The train_rnn_model() function is the training loop function for a the RNN. It handles training, validation, logging and model checkpointing.

In [None]:
def train_rnn_model(model, criterion, optimizer, train_loader, valid_loader, num_epochs, device, checkpoints_path=None, log_wandb=True, gradient_clip_val=1.0):
    best_val_accuracy = 0.0
    training_start_time = time.time()
    
    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch+1}/{num_epochs}")
        
        # Training phase
        model.train()
        train_loss = 0.0
        train_correct = 0
        train_total = 0
        
        with tqdm(train_loader, desc="Training") as progress_bar:
            for batch_data in progress_bar:
                # Unpack the batch data
                padded_sequences, sequence_lengths, indices, answers = batch_data
                
                # Move to device
                padded_sequences = padded_sequences.to(device)
                sequence_lengths = sequence_lengths.to(device)
                indices = indices.to(device)
                answers = answers.to(device)
                
                # Forward pass
                optimizer.zero_grad()
                outputs = model(padded_sequences, sequence_lengths, indices)
                loss = criterion(outputs, answers)
                
                # Backward pass
                loss.backward()
                if gradient_clip_val > 0:
                    nn.utils.clip_grad_norm_(model.parameters(), gradient_clip_val)
                optimizer.step()
                
                # Statistics
                train_loss += loss.item()
                _, predicted = torch.max(outputs, 1)
                train_total += answers.size(0)
                train_correct += (predicted == answers).sum().item()
                
                # Update progress bar
                progress_bar.set_postfix({
                    'loss': f"{loss.item():.4f}",
                    'acc': f"{train_correct/train_total:.4f}"
                })
        
        train_accuracy = train_correct / train_total
        avg_train_loss = train_loss / len(train_loader)
        
        # Validation phase
        model.eval()
        val_loss = 0.0
        val_correct = 0
        val_total = 0
        
        with torch.no_grad():
            for batch_data in tqdm(valid_loader, desc="Validation"):
                # Unpack the batch data
                padded_sequences, sequence_lengths, indices, answers = batch_data
                
                # Move to device
                padded_sequences = padded_sequences.to(device)
                sequence_lengths = sequence_lengths.to(device)
                indices = indices.to(device)
                answers = answers.to(device)
                
                # Forward pass
                outputs = model(padded_sequences, sequence_lengths, indices)
                loss = criterion(outputs, answers)
                
                # Statistics
                val_loss += loss.item()
                _, predicted = torch.max(outputs, 1)
                val_total += answers.size(0)
                val_correct += (predicted == answers).sum().item()
        
        val_accuracy = val_correct / val_total
        avg_val_loss = val_loss / len(valid_loader)
        
        # Save best model
        if val_accuracy > best_val_accuracy:
            print(f"Validation accuracy improved from {best_val_accuracy:.4f} to {val_accuracy:.4f}")
            best_val_accuracy = val_accuracy
            best_model_path = os.path.join(checkpoints_path, 'best_rnn_model.pt')
            save_checkpoint(model, optimizer, epoch+1, best_val_accuracy, best_model_path)
        
        # Print metrics
        print(f"Train Loss: {avg_train_loss:.4f}, Train Acc: {train_accuracy:.4f}")
        print(f"Val Loss: {avg_val_loss:.4f}, Val Acc: {val_accuracy:.4f}")
        
        # Log to wandb
        if log_wandb:
            wandb.log({
                "epoch": epoch,
                "train_loss": avg_train_loss,
                "train_accuracy": train_accuracy,
                "val_loss": avg_val_loss,
                "val_accuracy": val_accuracy,
                "learning_rate": optimizer.param_groups[0]['lr']
            })
    
    print(f"Training completed in {(time.time() - training_start_time)/60:.2f} minutes")
    print(f"Best validation accuracy: {best_val_accuracy:.4f}")
    
    # Final metrics for wandb
    if log_wandb:
        wandb.run.summary["best_val_accuracy"] = best_val_accuracy
        wandb.finish()
    
    return model, best_val_accuracy

Define checkpoint directory for the RNN.

In [None]:
checkpoint_dir_rnn = f"./checkpoints/rnn-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
os.makedirs(checkpoint_dir_rnn, exist_ok=True)

Run the training for the word embedding model.

In [None]:
trained_rnn_model, best_rnn_accuracy = train_rnn_model(
        model=rnn_model,
        criterion=criterion,
        optimizer=optimizer,
        train_loader=train_rnn_loader,
        valid_loader=valid_rnn_loader,
        num_epochs=num_epochs_rnn,
        device=device,
        checkpoints_path=checkpoint_dir_rnn,
        log_wandb=True,
        gradient_clip_val=rnn_run.config.gradient_clipping
    )

print(f"RNN training complete! Best validation accuracy: {best_rnn_accuracy:.4f}")

# **Evaluation**

**run1**
batch_size = 32
embedding_dim = 300
hidden_dim = 64
dropout_rate = 0.2

# **Interpretation**