In [1]:
from datasets import load_dataset

ds = load_dataset("StellarMilk/newsqa")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [2]:
ds

DatasetDict({
    train: Dataset({
        features: ['paragraph', 'questions', 'answers', 'questions_answers'],
        num_rows: 10327
    })
    validation: Dataset({
        features: ['paragraph', 'questions', 'answers', 'questions_answers'],
        num_rows: 574
    })
    test: Dataset({
        features: ['paragraph', 'questions', 'answers', 'questions_answers'],
        num_rows: 574
    })
})

In [3]:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

paragraphs = ds["train"]["paragraph"]
def preprocess_spacy(text):
    doc = nlp(text.lower())
    tokens = [
        token.lemma_               # use lemmatized form
        for token in doc
        if not token.is_stop       # remove stopwords
        and token.is_alpha         # keep alphabetic tokens only
        and len(token) > 2         # remove short words
    ]
    return tokens

sentences = []
batch_size = 50  # adjust depending on your RAM
for doc in nlp.pipe(paragraphs, batch_size=batch_size):
    tokens = [
        token.lemma_
        for token in doc
        if not token.is_stop and token.is_alpha and len(token) > 2
    ]
    if tokens:
        sentences.append(tokens)

print(f"Processed {len(sentences)} paragraphs into tokenized sentences.")

Processed 10327 paragraphs into tokenized sentences.


**1)BAG OF WORDS(BOW)**

BOW model represents text as a vector of word occurences , ignoring grammer , semantics and word order.
eg.


*   sent1: "I went to the market"
*  sent2: "I got fruits"



vocab becomes: [I, went ,to, the, market, got, fruits]
and vector embeddings become:

*   sent1: 1 1 1 1 1 0 0
*   sent2:1 0 0 0 0 1 1




In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
import pandas as pd

In [6]:
docs = [' '.join(s) for s in sentences]

In [7]:
bow = CountVectorizer()
X_bow = bow.fit_transform(docs)
vocab_bow = bow.get_feature_names_out()
df_bow = pd.DataFrame({
    "word": vocab_bow,
    "embedding": list(X_bow.toarray().T)
})

In [8]:
df_bow.to_csv("embeddings_bow1.csv", index=False)

**2)Term frequency inverse document frequency(TF-IDF)**

TF-IDF (Term Frequency – Inverse Document Frequency) represents text as a vector, but instead of just counting occurrences like BoW, it weights words based on importance in the corpus.

Frequent words across all documents (like “the”, “I”) get lower weight

Rare, meaningful words get higher weight

TF: freq of word in sentence
IDF: log(N/1+Nw) where N is total number of documents, Nw is number of documents contaning that word.

TF-IDF=TF * IDF: final embeddings
TF-IDF → counts × rarity → “weighted values per word”

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [10]:
X_tfidf = tfidf.fit_transform(docs)
vocab_tfidf = tfidf.get_feature_names_out()
df_tfidf = pd.DataFrame({
    "word": vocab_tfidf,
    "embedding": list(X_tfidf.toarray().T)
})

In [11]:
df_tfidf.to_csv("embeddings_tfidf1.csv", index=False)

**3)Word2Vec (CBOW)**

First model to capture **semantic meaning**.
Predicts a word given its context (neighboring words).
given a vocabulary of size
𝑉
 and embedding dimension
𝑑
, CBOW uses an embedding matrix
𝑊
∈
𝑅^(
𝑉
×
𝑑)

. For a target word, the input is the average of one-hot vectors of its context words, which is multiplied by
𝑊
to get a hidden representation. This is then passed through a **softmax layer to predict the target word**, and the embedding matrix
𝑊
is optimized via **cross-entropy loss**. The resulting vectors are dense, low-dimensional, and capture meaningful semantic relationships between words.

In [12]:
!pip install gensim



In [13]:
from gensim.models import Word2Vec

In [15]:
import gensim

In [16]:
from gensim.models.callbacks import CallbackAny2Vec

In [17]:
# CBOW
w2v_cbow = Word2Vec(
    sentences,
    vector_size=100,
    window=5,
    sg=0,  # CBOW
    min_count=2,
    workers=4,
    epochs=10,
    compute_loss=True,
    callbacks=[CallbackAny2Vec()]  # optional
)
cbow_df = pd.DataFrame({
    "word": w2v_cbow.wv.index_to_key,
    "embedding": [w2v_cbow.wv[w].tolist() for w in w2v_cbow.wv.index_to_key]
})
cbow_df.to_csv("embeddings_word2vec_cbow1.csv", index=False)

**4)Word2Vec -SkipGram**

Skipgram is opposite of CBOW.
Unlike CBOW, which predicts the target from context, Skip-Gram reverses the task: for each word in a sentence, the model tries to predict words within a fixed window around it.

with a vocabulary of size
𝑉
and embedding dimension
𝑑
, Skip-Gram uses an embedding matrix

W∈R^(
V×d)
 to map one-hot encoded target words to dense vectors. The hidden representation is then multiplied by another weight matrix and passed through a softmax to estimate probabilities for context words, and the model is trained using cross-entropy loss. The resulting embeddings are dense, low-dimensional, and capture semantic and syntactic relationships between words, allowing similar words to have similar vector representations.

In [18]:
w2v_sg = Word2Vec(sentences, vector_size=100, window=5, sg=1, min_count=2)
sg_df = pd.DataFrame({
    "word": w2v_sg.wv.index_to_key,
    "embedding": [w2v_sg.wv[w].tolist() for w in w2v_sg.wv.index_to_key]
})
sg_df.to_csv("embeddings_word2vec_skipgram1.csv", index=False)

**5)BIRECTIONAL ENCODER REPRESENTATIONS TRANSFORMERS(BERT)**

Attention-based Transformer model that generates contextualized word embeddings, meaning the vector for a word depends on its surrounding words in the sentence. Unlike BoW, TF-IDF, or Word2Vec (CBOW/Skip-Gram), which produce static embeddings, BERT captures rich semantic and syntactic information by processing text bidirectionally. It uses multi-layer self-attention to model relationships between all words in a sentence and outputs embeddings for each token.Using attention, it can tell which word is important to which other word and how important it is. Thus, these embeddings are super rich in context



In [19]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_bert_embedding(word):
    inputs = tokenizer(word, return_tensors='pt', truncation=True, max_length=10)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

# Get embeddings for a small vocab (since BERT is heavy)
vocab_sample = list(w2v_cbow.wv.index_to_key)[:1000]
bert_data = pd.DataFrame({
    "word": vocab_sample,
    "embedding": [get_bert_embedding(w) for w in vocab_sample]
})
bert_data.to_csv("embeddings_bert1.csv", index=False)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]