# Task A – Creating Word Embeddings

In this task, I wanted to understand and experiment with how words can be represented as numbers.  
Word embeddings basically capture the meaning of words in numerical form, so similar words are close to each other in vector space.

I used the **NewsQA dataset** for this Part. 
I decided to create embeddings using two methods:
- **Word2Vec (Skip-gram)** - a classical and common approach that learns based on context words.
- **BERT** – a modern pretrained Transformer model that understands words in context.

I will compare and save the embeddings for both methods.

In [1]:
import pandas as pd

splits = {'train': 'data/train-00000-of-00001-ec54fbe500fc3b5c.parquet', 'validation': 'data/validation-00000-of-00001-3cf888b12fff1dd6.parquet'}
df = pd.read_parquet("hf://datasets/lucadiliello/newsqa/" + splits["train"])

In [2]:
df.head()

Unnamed: 0,context,question,answers,key,labels
0,"NEW DELHI, India (CNN) -- A high court in nort...",What was the amount of children murdered?,[19],da0e6b66e04d439fa1ba23c32de07e50,"[{'end': [295], 'start': [294]}]"
1,"NEW DELHI, India (CNN) -- A high court in nort...",When was Pandher sentenced to death?,[February.],724f6eb9a2814e4fb2d7d8e4de846073,"[{'end': [269], 'start': [261]}]"
2,"NEW DELHI, India (CNN) -- A high court in nort...",The court aquitted Moninder Singh Pandher of w...,[rape and murder],d64cbb90e5134081acfa83d3e702408c,"[{'end': [638], 'start': [624]}]"
3,"NEW DELHI, India (CNN) -- A high court in nort...",who was acquitted,[Moninder Singh Pandher],fd7177ee6f1f4d62becd983a0305f503,"[{'end': [216], 'start': [195]}]"
4,"NEW DELHI, India (CNN) -- A high court in nort...",who was sentenced,[Moninder Singh Pandher],cd25c69f631349748ccdeccaace66463,"[{'end': [216], 'start': [195]}]"


### 1. Understanding the Data

The dataset has these columns:
- `context` → paragraph of text from the article  
- `question` → question asked from that paragraph  
- `answers` → correct answer

For word embeddings, I’ll combine the `context` and `question` only, because both are text that are useful.

In [3]:
df['text'] = df['context'].astype(str) + " " + df['question'].astype(str) #combine columns context and question
texts = df['text'].fillna("").astype(str).tolist()
print("Number of text entries:", len(texts))

Number of text entries: 74160


### 2. Basic Text Preprocessing

Before training, I cleaned the text to remove special characters, lowercase everything, and split into tokens (words).  
I didn’t go too aggressive because the goal here is just to make it clean enough for training.

In [4]:
import re
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def clean_text(text):
    text = text.lower()                   # lowercase all the characters
    text = re.sub(r'[^a-z\s]', '', text)  # remove special characters
    tokens = word_tokenize(text)
    return tokens

sentences = [clean_text(t) for t in texts]
print(sentences[0][:20])

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/macklinchrissmiranda/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['new', 'delhi', 'india', 'cnn', 'a', 'high', 'court', 'in', 'northern', 'india', 'on', 'friday', 'acquitted', 'a', 'wealthy', 'businessman', 'facing', 'the', 'death', 'sentence']


### 3. Word2Vec – Skip-gram Model

I used **Word2Vec** from the Gensim library.  
Skip-gram predicts surrounding words given a target word, which helps in understanding rare words better CBOW.  
I used:
- vector size = 100  
- window = 8 (context size)
- min_count = 3 (ignore rare words)

In [5]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=8,
    min_count=3,
    sg=1                   # 1 for skip-gram
)

print("Vocabulary size:", len(w2v_model.wv))

Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Vocabulary size: 93390


### 4. Saving Word2Vec Embeddings

Now I saved all the word embeddings created by skip-gram as a CSV file in the format `(word, embedding)`.  
Each word will have a 100-dimensional vector as the vector_size given above was 100.

In [6]:
import csv

with open("word2vec_embeddings.csv", "w", newline="") as f:
    write = csv.writer(f)
    write.writerow(["word", "embedding"])
    for word in w2v_model.wv.index_to_key:
        write.writerow([word, w2v_model.wv[word].tolist()])

print("Word2Vec embeddings saved!")

Word2Vec embeddings saved!


### 5. Checking Similar Words

I tested some example words to see if the model learned meaningful relations.

In [7]:
print(w2v_model.wv.most_similar('india', topn=5))

[('indias', 0.7427746057510376), ('delhi', 0.7172093391418457), ('indian', 0.6961739659309387), ('mumbai', 0.6740450263023376), ('pallavakam', 0.646253228187561)]


In [8]:
print(w2v_model.wv.most_similar('court', topn=5))

[('judge', 0.8309882879257202), ('supreme', 0.8072786927223206), ('kirkwood', 0.7761055827140808), ('appeals', 0.7570107579231262), ('nowjustice', 0.7543099522590637)]


# Use of Pretrained Models
I also wanted use BERT Model as it is good for contextual embedding for rare words and also for words with different uses based on the context, but since I will be using those in later Parts of this Task, i skipped it for Part A.