# NLP.F2501 Course Project 1 (Word embeddings and RNNs)

Nevin Helfenstein

# Introduction

In this notebook, I present my solution to the CommonsenseQA task. I'll train a model using word embeddings, RNNs, and other NLP techniques to achieve the best possible performance.

## Dataset Description

The CommonsenseQA dataset [(Talmor et al., 2019)](https://aclanthology.org/N19-1421/) contains 12,247 multiple-choice questions specifically designed to test commonsense reasoning. Unlike standard QA tasks, these questions require prior knowledge about how concepts relate in the real world.

Questions were created by extracting related concepts from ConceptNet and having crowd-workers author questions that require distinguishing between them. This methodology produced challenging questions that often cannot be answered through simple pattern matching.

The best baseline in the original paper (BERT-large) achieved only 56% accuracy compared to human performance of 89%, shwoing the difficulty of encoding human-like commonsense reasoning.

# Setup

First we import all the needed libraries

In [1]:
import torch
import nltk
import wandb
import logging
import re

import gensim

import pandas as pd
import seaborn as sns
from wordcloud import WordCloud

import torch.nn as nn
import numpy as np

import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from torch.utils.data import Dataset, DataLoader
from matplotlib_venn import venn2


from datasets import load_dataset
from huggingface_hub import hf_hub_download

from collections import Counter
from datetime import datetime


  from .autonotebook import tqdm as notebook_tqdm


### WandDB

We set up the configuration to W&B for later use

In [None]:
#TODO: Correct configuration for the new wanddb project

wandb.login(key="")
wandb_logger = WandbLogger(project="experiment-tracking")

def init_wandb_run(project_name, run_name, config_dict):
    run = wandb.init(
        project=project_name,
        name=run_name,
        config=config_dict,
    )
    return run


#### Example Run
experiment_1_config = {
    "learning_rate": 0.001,
    "epochs": 10,
    "batch_size": 32,
    "optimizer": "adam"
}
run1 = init_wandb_run("experiment-tracking", "experiment_1", experiment_1_config)


### Fixed variables

We set the random seed for all the necessary configurations  to ensure reproducibility

In [2]:
SEED = 42

np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Tokenizer and embedding model

I've selected FastText embeddings with the common crawl model (crawl-3dd-2m-subword) for this project for the following reasons:

* **Subword modeling** - handles unknown words & typos
* **Morphologically aware** - recognizes word relationships 
* **Massive training corpus** - 600B tokens from Common Crawl
* **Rich embeddings** - 300 dimensions, 2M word vectors
* **Proven performance** - excels in commonsense reasoning tasks
* **Well Documented** - is known in the NLP community and is very well documented

Download tokenizer files

In [3]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Download the FastText model from Hugging Face (Facebook's common crawl model)

In [4]:
#model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")

model_path = hf_hub_download(
    repo_id="facebook/fasttext-en-vectors", 
    filename="model.bin",
    proxies=None,
    resume_download=True,
    etag_timeout=900,
    local_files_only=False,
    token=None
)

Error while downloading from https://cdn-lfs.hf.co/repos/d5/17/d51783e8db2539ebd639d94860a1c0c525027924d32e3d13b6b4595c4e639469/14c7167b130056944cbdc37b7451f055867fe9a4e3fed3bbc1ecc0e74f6763ca?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.bin%3B+filename%3D%22model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1742814607&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MjgxNDYwN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kNS8xNy9kNTE3ODNlOGRiMjUzOWViZDYzOWQ5NDg2MGExYzBjNTI1MDI3OTI0ZDMyZTNkMTNiNmI0NTk1YzRlNjM5NDY5LzE0YzcxNjdiMTMwMDU2OTQ0Y2JkYzM3Yjc0NTFmMDU1ODY3ZmU5YTRlM2ZlZDNiYmMxZWNjMGU3NGY2NzYzY2E%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=ldJgmG2KBsUrN4L5HJXXRYUYUGYYCfuBa86d1Nb97-Su06s-6N7LL9U526lSuEIuwXHd4waiw0wk7hnWWwwZAwVWRgE8BrrfSBhDCncZSRjrSujEiy%7Eb6rnZi2PaSt6DopXsj74oN39xG5Rj5HzFSAkgBrJZEsG%7EzeytGkLYymA260eR0WZKTeEAzqsHaiATNGnA7F

KeyboardInterrupt: 

Load the model

In [None]:
ft_model = gensim.models.fasttext.load_facebook_vectors(model_path)
wv = ft_model.wv

Create a function to get embeddings for words (retruns word vector)

In [None]:
def get_fasttext_embedding(word):
    try:
        return ft_model[word]
    except KeyError:
        return np.zeros(300)

### Data Splits

The data is available on Hugging Face: [Data](https://huggingface.co/datasets/tau/commonsense_qa).
Only the train and validation splits have an answer key, because of that we will use our own dataset splits.
We use all of the datasamples as the train set except for the last 1000 which we set as the validation set. The original validation set is set as the test set.

In [None]:
train = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

## Data exploration

We convert to DataFrames for easier analysis

In [None]:
train_df = pd.DataFrame(train)
valid_df = pd.DataFrame(valid)
test_df = pd.DataFrame(test)

We check basic statistics such as average question and choice lenght for each DataFrame

In [None]:
def analyze_dataset(df, name):
    """Analyze basic statistics of a dataset split"""
    print(f"=== {name} Set Statistics ===")
    print(f"Number of examples: {len(df)}")
    
    # Question statistics
    df['question_tokens'] = df['question'].apply(lambda x: word_tokenize(x))
    df['question_length'] = df['question_tokens'].apply(len)
    
    # Answer choices statistics
    df['choices_length'] = df['choices'].apply(lambda x: [len(word_tokenize(choice['text'])) for choice in x])
    df['avg_choice_length'] = df['choices_length'].apply(np.mean)
    
    print(f"Average question length: {df['question_length'].mean():.2f} tokens")
    print(f"Average answer choice length: {df['avg_choice_length'].mean():.2f} tokens")
    print(f"Min/Max question length: {df['question_length'].min()}/{df['question_length'].max()} tokens")
    
    # Find correct answer position
    df['correct_answer_idx'] = df.apply(lambda row: next((i for i, choice in enumerate(row['choices']) 
                                                         if choice['label'] == row['answerKey']), -1), axis=1)
    
    return df

train_df = analyze_dataset(train_df, "Training")
valid_df = analyze_dataset(valid_df, "Validation")
test_df = analyze_dataset(test_df, "Test")

We plot the question lenght distribution

In [None]:
plt.figure(figsize=(12, 6))

sns.histplot(data=train_df, x='question_length', kde=True, label='Train', alpha=0.6)
sns.histplot(data=valid_df, x='question_length', kde=True, label='Validation', alpha=0.6)
sns.histplot(data=test_df, x='question_length', kde=True, label='Test', alpha=0.6)

plt.title('Distribution of Question Lengths')
plt.xlabel('Number of tokens in question')
plt.ylabel('Count')
plt.legend()
plt.savefig('question_length_distribution.png')
plt.close()

We plot the answer length distribution

In [None]:
plt.figure(figsize=(12, 6))

sns.histplot(data=train_df, x='avg_choice_length', kde=True, label='Train', alpha=0.6)
sns.histplot(data=valid_df, x='avg_choice_length', kde=True, label='Validation', alpha=0.6)
sns.histplot(data=test_df, x='avg_choice_length', kde=True, label='Test', alpha=0.6)

plt.title('Distribution of Answer Choice Lengths')
plt.xlabel('Average number of tokens in answer choices')
plt.ylabel('Count')
plt.legend()
plt.savefig('answer_length_distribution.png')
plt.close()

Count correct answer keys

In [None]:
plt.figure(figsize=(10, 6))

train_pos_counts = Counter(train_df['correct_answer_idx'])
valid_pos_counts = Counter(valid_df['correct_answer_idx'])
test_pos_counts = Counter(test_df['correct_answer_idx'])

Convert that to percentage

In [None]:
train_pos_percent = {k: v/len(train_df)*100 for k, v in train_pos_counts.items()}
valid_pos_percent = {k: v/len(valid_df)*100 for k, v in valid_pos_counts.items()}
test_pos_percent = {k: v/len(test_df)*100 for k, v in test_pos_counts.items()}

Create a DataFrame again for better plotting

In [None]:
pos_df = pd.DataFrame({
    'Train': [train_pos_percent.get(i, 0) for i in range(5)],
    'Validation': [valid_pos_percent.get(i, 0) for i in range(5)],
    'Test': [test_pos_percent.get(i, 0) for i in range(5)]
}, index=['A', 'B', 'C', 'D', 'E'])

pos_df.plot(kind='bar', figsize=(10, 6))
plt.title('Distribution of Correct Answer Positions')
plt.xlabel('Answer Position')
plt.ylabel('Percentage (%)')
plt.legend()
plt.savefig('answer_position_distribution.png')
plt.close()

Create word clouds for questions

In [None]:
def create_wordcloud(texts, title, filename):
    """Create and save wordcloud from a list of texts"""
    plt.figure(figsize=(12, 8))
    
    all_text = ' '.join(texts)
    
    wordcloud = WordCloud(width=800, height=400, 
                         background_color='white',
                         max_words=100, 
                         contour_width=3).generate(all_text)
    
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=16)
    plt.tight_layout()
    plt.savefig(filename)
    plt.close()

create_wordcloud(train_df['question'].tolist(), 'Word Cloud - Training Questions', 'train_question_wordcloud.png')

Analyze question types (what, how, why, etc.)

In [None]:
def get_question_type(question):
    """Extract the question word from a question"""
    question = question.lower().strip()
    question_words = ['what', 'which', 'who', 'how', 'why', 'when', 'where']
    
    for word in question_words:
        if question.startswith(word) or f" {word} " in question:
            return word
    
    return 'other'

train_df['question_type'] = train_df['question'].apply(get_question_type)
valid_df['question_type'] = valid_df['question'].apply(get_question_type)
test_df['question_type'] = test_df['question'].apply(get_question_type)

plt.figure(figsize=(12, 6))

Count question types

In [None]:
train_type_counts = Counter(train_df['question_type'])
valid_type_counts = Counter(valid_df['question_type'])
test_type_counts = Counter(test_df['question_type'])

Converte to percentage

In [None]:
train_type_percent = {k: v/len(train_df)*100 for k, v in train_type_counts.items()}
valid_type_percent = {k: v/len(valid_df)*100 for k, v in valid_type_counts.items()}
test_type_percent = {k: v/len(test_df)*100 for k, v in test_type_counts.items()}

Plot the question type distribution

In [None]:
common_types = set(list(train_type_counts.keys()) + list(valid_type_counts.keys()) + list(test_type_counts.keys()))
type_df = pd.DataFrame({
    'Train': [train_type_percent.get(t, 0) for t in common_types],
    'Validation': [valid_type_percent.get(t, 0) for t in common_types],
    'Test': [test_type_percent.get(t, 0) for t in common_types]
}, index=common_types)

type_df.plot(kind='bar', figsize=(12, 6))
plt.title('Distribution of Question Types')
plt.xlabel('Question Type')
plt.ylabel('Percentage (%)')
plt.legend()
plt.savefig('question_type_distribution.png')
plt.close()

POS (Part-of-Speach) Tag analysis for questions

In [None]:
def analyze_pos_tags(texts, n=10):
    """Analyze the most common POS tags in a list of texts"""
    all_pos = []
    for text in texts:
        tokens = word_tokenize(text)
        tags = pos_tag(tokens)
        all_pos.extend([tag for _, tag in tags])
    
    return Counter(all_pos).most_common(n)

# Get most common POS tags in train questions
train_pos = analyze_pos_tags(train_df['question'].tolist())

plt.figure(figsize=(12, 6))
pos_df = pd.DataFrame(train_pos, columns=['POS Tag', 'Count'])
sns.barplot(x='POS Tag', y='Count', data=pos_df)
plt.title('Most Common POS Tags in Training Questions')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('pos_tag_distribution.png')
plt.close()

Analyze vocabulary overlap between train and test

In [None]:
def get_vocab(texts):
    """Get vocabulary from a list of texts"""
    vocab = set()
    for text in texts:
        tokens = word_tokenize(text.lower())
        vocab.update(tokens)
    return vocab

train_vocab = get_vocab(train_df['question'].tolist())
test_vocab = get_vocab(test_df['question'].tolist())

Calculate overlap and plot venn diagramm

In [None]:
overlap = len(train_vocab.intersection(test_vocab))
train_only = len(train_vocab - test_vocab)
test_only = len(test_vocab - train_vocab)

plt.figure(figsize=(10, 6))

venn2(subsets=(train_only, test_only, overlap), 
      set_labels=('Train Vocabulary', 'Test Vocabulary'))
plt.title('Vocabulary Overlap Between Train and Test Sets')
plt.savefig('vocab_overlap.png')
plt.close()

print(f"Train vocabulary size: {len(train_vocab)}")
print(f"Test vocabulary size: {len(test_vocab)}")
print(f"Vocabulary overlap: {overlap} words ({overlap/len(train_vocab)*100:.2f}% of train vocab)")

# Preprocessing

As per the project requirements we need correct and justified decisions on: 
- Tokenization 
- Lowercasing, stemming, lemmatizing, stopword/punctuation removal 
- Removal of unknown/other words 
- Format cleaning (e.g. html-extracted text) 
- Truncation 
- Feature selection 
- Input format: how is data passed to the model? 
- Label format: what should the model predict? 
- Batching, padding 
- Vocabulary, embedding

We choose minimal preprocessing:
- We tokenize using NLTK's word tokenizer
- We preserve case, stopwords, and punctuation as they contain valuable information
- We skip stemming/lemmatization since our model (fasttext) handles word variations effectively
- No unknown word removal needed as fasttext creates vectors from character substrings
- No truncation required since all questions are under 400 characters
- No format cleaning needed as the dataset is already clean

To mention some examples, my preprocess function just performs simple tokenization preserving case, punctuation and all words including stopwords. I choose this approach because:
  - Case preservation: Words like 'I' or 'US' retain their semantic meaning
  - Stopword retention: Function words provide grammatical context important for question answering
  - Punctuation preservation: Punctuation can carry meaningful information
  - No stemming/lemmatization: Fasttext can handle morphological variations through subword embeddings
  - No unknown word removal: Fasttext creates vectors from character n-grams, allowing it to handle OOV words and misspellings

In [None]:
def preprocess_text(text):
    if not isinstance(text, str):
        raise TypeError("Input must be a string, got {0} instead".format(type(text).__name__))
    
    if not text or text.isspace():
        raise ValueError("Input text cannot be empty or whitespace only")

    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        try:
            nltk.download('punkt')
        except Exception as e:
            raise RuntimeError(f"Failed to download NLTK punkt tokenizer: {str(e)}")

    try:
        tokens = word_tokenize(text)
        
        if len(tokens) == 0 and len(text.strip()) > 0:
            raise RuntimeError("Tokenization produced no tokens for non-empty input")
            
        return tokens
        
    except Exception as e:
        raise RuntimeError(f"Tokenization failed: {str(e)}")

README.md:   0%|          | 0.00/7.39k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/160k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/151k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9741 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1221 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1140 [00:00<?, ? examples/s]

To get a better understanding we check how the loaded embeddings look and how big the vocabulary is.

In [None]:
print(f"Word embeddings vector size: {wv.vector_size}")
print(f"Word embeddings vocab size: {len(wv.index_to_key)}")

print("\nFirst 10 words in vocabulary:")
print(wv.index_to_key[:10])

print("\nLast 5 words in vocabulary:")
print(wv.index_to_key[-5:])

Further we analyse how the vectors of different words look like

In [None]:
#print vector of word that should exist
print(wv["if"])

#print vector of word that should not exist
print(wv["apdnbajknbäaperoanböajnbäpad"])

# Model

# Training

# Evaluation

# Interpretation