# Text Vectorization

In this project, I will explore different text vectorization techniques and compare how effective they are in representing textual data.

### List of Contents

1. What is text vectorization?
    * Text Vectorization and Text Embedings
2. Data colection
    * Dataset selection
3. Data exploration 
    * Data visualization
4. Preprocessing text data
5. TF-IDF Vectorization
    * Understanding TF-IDF
    * Implementing TF-IDF with scikit-learn
    * Code example
    * Analyzing most important words per category
6. Word Embeddings: Word2Vec
    * Introduction to Word2Vec
7. Sentence embeddings – BERT and sentence transformers
    * Difference between word and sentence embeddings
    * BERT
        * Using BERT for Sentence Embeddings
        * Question Answering (QA)
        * Named Entity Recognition (NER)
        * Masked Language Modeling (MLM)
    * Generate sentence embeddings using SBERT
8.  Visualizing vectorized text representations

### 1. What is text vectorization?

Text vectorization is the process of *converting text into numerical representations* so that machine learning models and other computational algorithms can process and analyze it. By doing this, operations on sentences become more like math equations, which is something computers can do quickly, and can do well [1].

Many tasks that one would like to perform on textual data like text classification, clustering and search engines can be done much more efficiently with numbers rather than words. Since most algorithms operate on numerical data, vectorizing text is crucial for performing these tasks efficiently.

#### Text Vectorization and Text Embedings

Text vectorization is a much broader term that includes any method that converts text into numerical form. In this regard, **embeddings** are a *specific type* of vectorization. 

Traditional vectorization methods often rely on *sparse vectors*. For example, if we have a vocabulary consisting of four words: (orange, apple, mango, banana), and we want to represent "apple" using one-hot encoding, a possible representation would be [0,1,0,0]. Since the vector size depends on the vocabulary size, if the vocabulary has 100,000 words, each word or document is represented by a 100,000-dimensional vector. However, most of the values in these vectors are zero, leading to inefficiencies in storage and computation (you still have to store and process all those zeroes). Additionally, as dimensionality increases, similarity calculations become less meaningful, making clustering or comparing texts based on meaning more difficult.

A major limitation of these methods is that they lack **context awareness** (each word is treated independently, ignoring relationships between them). For example, "car" and "automobile" would be considered completely different, even though they have similar meanings. Likewise, traditional vectorization methods struggle with **polysemy**, where the same word has multiple meanings depending on context (e.g., "bank" as a financial institution vs. "bank" as the side of a river). 

Examples of traditional vectorization methods include **One-Hot Encoding**, **Count Vectorization**, and **TF-IDF**. 

Text embeddings, on the other hand, represent words or sentences as dense, continuous vectors. Each dimension typically carries meaningful information, enabling the encoding of relationships between words. Lower dimensionality also helps reducing memory usage and speeds up computations.

Examples of embeddings are Word2Vec, GloVe, FastText and BERT embeddings.

### 2. Data colection

#### Dataset selection

For this project, I chose to work with the BBC News dataset, which contains more than 2000 pre-categorized news articles across five topics. This dataset is small enough for rapid experimentation while still being large enough to reflect real-world document processing challenges.

Unlike other text sources that may contain errors and misspellings, this dataset consists of well-formed sentences. This allows the focus to remain on text representation and modeling techniques rather than data cleaning. News articles offer rich, diverse, and formal text and they include domain-specific terminology, which is great for applying word embeddings and vectorization techniques.

In [None]:
import os
import requests

In [None]:
bbc_url = "https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv"
bbc_path = "bbc-text.csv"

In [None]:
# Download dataset
if not os.path.exists(bbc_path):
    response = requests.get(bbc_url)
    with open(bbc_path, "wb") as f:
        f.write(response.content)
    print("Dataset downloaded successfully.")
else:
    print("Dataset already exists.")

### 3. Data exploration

In [None]:
# Let's load the dataset as a dataframe for easy manipulation

import pandas as pd

df = pd.read_csv("bbc-text.csv")

In [None]:
print(df.head())  

The dataframe contains the new's category in the column 'category' and the piece of news itself in the 'text' column 

In [None]:
# Check data types and missing values

print(df.info())  

#### Data visualization

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(6,3))
df["category"].value_counts().plot(kind="bar", title="Category Distribution", color="skyblue")
plt.xlabel("Category")
plt.ylabel("Count")
plt.show()


As mentioned before, the dataset does not contain null values, and the news classes are fairly balanced.

### 4. Preprocessing text data

For this crucial step I am going to use the spaCy python package.

In [None]:
import spacy

In [None]:
# Load the English NLP model

nlp = spacy.load("en_core_web_sm")

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. Then the Doc objects is processed in different steps (processing pipeline).

The pipeline used by default includes a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

The disable keyword argument is used to for disabling pipeline components that are not needed [2]. 

In [None]:
def preprocess_texts(texts):
    processed_texts = []
    for doc in nlp.pipe(texts, disable=["ner", "parser"]):  
        tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]
        processed_texts.append(" ".join(tokens))
    return processed_texts

df["processed_text"] = preprocess_texts(df["text"])


In [None]:
df.head()

### 5. TF-IDF Vectorization

#### Understanding TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) is a statistical measure that evaluates how important a word is to a document in a context of a group of documents. 

The TF-IDF score is defined as:

$$
TF-IDF(w) = TF(w) \times IDF(w)
$$

Let's begin defining then what each term means.

* Term frequency (TF): it measures how often a word (w) appears in a document. Calculating this is pretty straight forward:

$$
TF(w) = \frac{\text{Number of times } w \text{ appears in a document}}{\text{Total number of words in the document}}
$$


* Inverse Document Frequency (IDF): if the importance of a word is measured by it's frequency, then common 'meaningless' words would dominate. To fully capture this, the weight of common words needs to be reduced. This term does exactly that.

$$
IDF(w) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing } w}
$$


So for instance common words (e.g., "the", "is") get low scores because they appear in almost all documents, whether 
important words (e.g., "Brexit" in a political article) get higher scores because they appear frequently in fewer documents.

#### Implementing TF-IDF with scikit-learn

I am going to use the TfidfVectorizer module from scikit-learn, and apply it to to the the processed_text column. This module does some preprocessing (lowercasing, tokenization), but of course custom preprocessing is necessary  before passing the text in most cases

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

When calling fit_transform() the output is a Tf-idf-weighted document-term matrix. After applying TF-IDF vectorization, each document (in this particular case, each news article) is transformed into a numerical vector, where each dimension represents a unique word from the entire dataset. The value in each dimension is the TF-IDF score of that word for that particular document.

The first thing the model does is to build the feature space (or 'vocabulary') which entails analyze the entire dataset and extract all unique words. These unique words form the features (or dimensions) of our numerical vector representation.

Then, each document is converted into a vector of length equal to the vocabulary size. The value at each position in the vector corresponds to the TF-IDF score of the corresponding word in that document. If a word does not appear in a document, its TF-IDF score is zero for that document. 

Let's illustrate this with an example: 

$$
\begin{array}{|c|c|}
\hline
\text{category} & \text{processed\_tex} \\
\hline
\text{tech} & \text{'tv future hand viewer home theatre system'}  \\
\text{business} & \text{'worldcom boss leave book worldcom boss'}  \\
\text{sport} & \text{'tiger wary farrell gamble leicester rush'} \\
\hline
\end{array}
$$



After processing, suppose the vocabulary (unique words across all documents) looks like this:

$$
[\text{'tv'}, \text{'future'}, \text{'hand'}, \text{'leave'}, 
\text{'home'}, \text{'boss'}, \text{'rush'}, \text{'book'},\text{'wary'}]
$$
Now, each document is represented as a vector of TF-IDF scores for these words:

$$
\begin{array}{|c|c|c|c|c|c|c|c|c|c|}
\hline
\text{Document ID} & \text{tv} & \text{future} & \text{hand} & \text{leave} & \text{home} & \text{boss} & \text{rush} & \text{book} & \text{wary} \\
\hline
1 & 0.50 & 0.60 & 0.60 & 0.00 & 0.50 & 0.00 & 0.00 & 0.00 & 0.00 \\
2 & 0.00 & 0.00 & 0.00 & 0.50 & 0.00 & 0.60 & 0.00 & 0.50 & 0.00 \\
3 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.50 & 0.00 & 0.60 \\
\hline
\end{array}
$$

Each row represents a document, and each column corresponds to a word in the vocabulary, with its TF-IDF score.

The resulting matrices often contain a lot of zeros. Let's consider the extreme case where the feature space is very large  but each news article is relatively short. In this case, most of the entries in the matrix will be zero, as each document will only contain a small subset of the vocabulary.

If we store this matrix as a dense matrix, it would be highly inefficient because we would still need to allocate memory for all those zero values

A more memory-efficient solution is to use a sparse matrix representation, such as the Compressed Sparse Row (CSR) format. This format only stores the nonzero values, along with their corresponding row and column indices, which significantly reduces memory usage.

#### Code example

In [None]:
# Initialize TF-IDF vectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # stick with the 5000 most common words

In [None]:
# Fit the vectorizer and transform the processed text

tfidf_matrix = tfidf_vectorizer.fit_transform(df["processed_text"])

In [None]:
type(tfidf_matrix)

In [None]:
print(tfidf_matrix)

Each line follows the format (document_index, word_index) __ TF-IDF score

Let's look at the first line:

0 → The document index (i.e., first news article).

4668 → The column index (i.e., word's position in the vocabulary).

0.437 → The TF-IDF score for that word in the document.

Convert to array format for manipulation

In [None]:
tfidf_array = tfidf_matrix.toarray()

In [None]:
# Get feature names

tfidf_features = tfidf_vectorizer.get_feature_names_out()

In [None]:
# convert TF-IDF matrix to dataframe for convenience

tfidf_df = pd.DataFrame(tfidf_array, columns=tfidf_features)

print(tfidf_df.head())

#### Analyzing most important words per category

In [None]:
import numpy as np

# Get the average TF-IDF score for each word across all documents
category_means = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
category_means["category"] = df["category"]

# Compute mean TF-IDF score for each category
category_tfidf = category_means.groupby("category").mean()

# Display top words for each category
for category in category_tfidf.index:
    print(f"\nTop words in {category} articles:")
    print(category_tfidf.loc[category].nlargest(10))  # Show top 10 words

<ins>observation</ins> : a word that appears in every category is "say." This makes sense, even though it is not a particularly meaningful word in terms of topic-specific content. Given that the dataset consists of news articles, it is common for journalists to cite statements from sources. This frequent attribution of speech may explains why "say" is among the most common words across categories. 

Theoretically "say" should have a low TF-IDF score because it's frequency across every category. However, if its TF is extremely high, and the IDF is not low enough, it might still rank highly. One possible approach to adress this is to use a more aggressive IDF weighting or even normalize the score. An even more radical approach would be to add the word as a stop-word for removal.

### 6. Word embeddings: Word2Vec 

Now that we've covered TF-IDF, let's move on to word wmbeddings, specifically Word2Vec. As discussed earlier, text embeddings represent words as **low-dimensional**, **dense vectors**, and can capture their relationships in a *continuous vector space*. 

Unlike traditional text vectorization methods, word embeddings have the advantage that similar words (e.g., king and queen) are positioned closely in the vector space. This enables embeddings to capture semantic relationships between words, which is not possible with simple text vectorization techniques like TF-IDF

#### Introduction to Word2Vec

Word2Vec is a neural network-based approach that learns word relationships by analyzing their context in large collection ot texts.

Let's begin by importing Gensim, an open-source Python library  designed for unsupervised topic modeling and NLP tasks, which include, among others, the Word2Vec algorithm

In [None]:
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

In [None]:
# Tokenize text
sentences = df["processed_text"].apply(lambda x: simple_preprocess(x)).tolist()

In [None]:
# Train Word2Vec model
w2v_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=2, workers=4)

In [None]:
# Save model
w2v_model.save("word2vec_bbc.model")
print("Word2Vec model trained and saved!")

In [None]:
# Load model
w2v_model = Word2Vec.load("word2vec_bbc.model")

# Find words most similar to "government"
print(w2v_model.wv.most_similar("government", topn=5))


### 7. Sentence embeddings – BERT and sentence transformers

#### Difference between word and sentence embeddings

Unlike word embeddings, where the transformation is applied to each word individually, sentence embeddings encode entire sentences into a single vector representation.

In models like Word2Vec, the relationships between words are often lost. For example, the sentences "John loves Mary" and "Mary loves John" have very different meanings, yet their word embeddings may not capture this distinction effectively. Word order and sentence semantics are not preserved.

In contrast, sentence embeddings retain the full meaning of a sentence. They transform entire sentences into dense, low-dimensional real-valued vectors that capture both word relationships and contextual meaning.

#### BERT 

BERT (Bidirectional Encoder Representations from Transformers) is a contextual language model that processes text while considering both the left and right context of each word.

Traditional word embeddings, like those from Word2Vec, generate static word vectors, meaning a word has the same representation regardless of context. In contrast, BERT produces dynamic, context-dependent embeddings: the same word can have different representations depending on the context.

BERT is designed for token-level tasks such as named entity recognition, part-of-speech tagging and question answering. It can also be fine-tuned for classification tasks like sentiment analysis and spam detection.

##### Using BERT for Sentence Embeddings

BERT can generate sentence embeddings using the [CLS] token, but this approach is not optimized for similarity tasks. Comparing two sentences with BERT is computationally expensive because embeddings must be recomputed for every new pair, making it slower than models specifically designed for sentence similarity, such as Sentence-BERT (SBERT).

##### Question Answering (QA)

In this section I am going to explore the use of the Hugging Face's Transformers library to apply BERT for question qnswering on the BBC dataset. Transformers provides APIs and tools to download and train state-of-the-art pretrained models for NLP tasks, computer vision among others [3].

The model will take a context (a news article) and a question as inputs and will find the exact answer in the text.

In [None]:
from transformers import pipeline

In [None]:
# Select an article as context

context = df.loc[4, "text"]  
print(context)

In [None]:
# Initialize the question answering pipeline

qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

**pipeline** is a high-level API that simplifies the use of pre-trained deep learning models for NLP tasks.

The argument **"question-answering"** specifies that we want to use a model for extracting answers from a given context.

We use distilbert-base-cased-distilled-squad, a lightweight but powerful BERT model for QA.
**"distilbert-base-cased-distilled-squad"** is a DistilBERT model fine-tuned on the SQuAD (Stanford Question Answering Dataset). It is a smaller and faster version of BERT that retains almost the same performance.

In [None]:
# Let's think about some appropiate questions

questions = [
    "Who stars in Ocean's Twelve?",
    "How much did Ocean’s Twelve earn in its opening weekend at the US box office?",
    "Which film did Ocean’s Twelve surpass to become number one at the US box office?",
    "Who directed Ocean’s Twelve?",
    "How did US critics react to Ocean’s Twelve?"
]

In [None]:
# Get answers from BERT

for question in questions:
    answer = qa_pipeline(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {answer['answer']}\n")

BERT got the director right and provided a somewhat reasonable answer for the critics' reaction, but it made significant errors in identifying the cast, box office earnings, and the film Ocean’s Twelve surpassed.

The model predicts answers based on context. Because it processes text in chunks, this sometimes leads to misinterpretation when multiple similar entities (e.g., multiple numbers, multiple names) are together. It misidentified "Steven Soderbergh" as an actor, likely because his name appeared near the cast list. It also incorrectly pulled "$110m" instead of "$40.8m" maybe because "$110m" appears later in the text. 

BERT is decent for simple fact extraction, but it’s not great at reasoning, handling numbers, or distinguishing subtle relationships in a complex text.

##### Named Entity Recognition (NER) with BERT

NER identifies people, organizations, locations, and more in the BBC articles.

In [None]:
# Let's Select a different article to analyze

ner_text = df.loc[2, "text"]
print(ner_text)

In [None]:
# Initialize  NER pipeline 

ner_pipeline = pipeline("ner", model="dslim/bert-base-NER")



I also tried the model dbmdz/bert-large-cased-finetuned-conll03-english. The thing is that this particular model is *cased*, meaning it expects properly capitalized words (e.g., "Andy Farrell" instead of "andy farrell"). If the input is all lowercase, it may fail to recognize named entities.

In [None]:
# Run NER

entities = ner_pipeline(ner_text)

In [None]:
print(entities)

In [None]:

for entity in entities:
    print(f"Entity: {entity['word']}, Type: {entity['entity']}, Score: {entity['score']:.2f}")

The first detected 'word' is 'far', and is being recognized as B-PER (beginning of a person’s name) with moderate confidence (score: 0.83). 'far' is likely a truncated part of 'Farrell', which means the model is not properly recognizing full names.

The second detected word is another instance of "far" being misclassified as a person's name.

The third 'word' is 'en', recognized as B-ORG (beginning of an organization’s name) but this time with very low confidence
(Score: 0.46). The most likely issue here is that 'en'is a fragment of another word (maybe "England" or "Leicester") that got cut. 

The model is recognizing fragments ("far" from "Farrell" and "en" from something else) instead of full entities. This happens because transformers tokenize words into *subwords*, and if the model isn't trained well on reassembling them, it gives partial results. Another issue thay may affect the model's performance is that the input is *lowercased*. If possible, the input should always be properly cased for NER tasks!

However, not everything is lost. From the output, it is possible to extract the token’s position in the sequence. If a token is suspected to be part of a relevant word, the complete word can be reconstructed by combining adjacent tokens.

Another possible approach is to split the text into sentences before processing, rather than analyzing the entire text at once.

#### Masked Language Modeling (MLM) 

MLM allows us to predict missing words in a sentence and it has practical applications in auto-completion and text suggestion for search engines, virtual assistants, and code editors. MLM is useful for spell checking and grammar correction as well.

In [None]:
# Initialize the MLM pipeline

mlm_pipeline = pipeline("fill-mask", model="bert-base-uncased")

In [None]:
# Select a sentence from a BBC article and mask a word

mlm_text = "the way [MASK] watch tv will be radically different in five years  time"
print(mlm_text)

In [None]:
# Get predictions

predictions = mlm_pipeline(mlm_text)

In [None]:
# Print top predictions
for pred in predictions:
    print(f"Predicted Word: {pred['token_str']}, Confidence: {pred['score']:.2f}")

In [None]:
mlm_text_expanded = "tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way [MASK] watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas"

In [None]:
# Get predictions for the expanded phrase

predictions_expanded = mlm_pipeline(mlm_text_expanded)

In [None]:
# Print top predictions
for pred in predictions_expanded:
    print(f"Predicted Word: {pred['token_str']}, Confidence: {pred['score']:.2f}")

With the original sentence, the model is uncertain about the subject and predicts "you" (37%), "we" (26%), "I" (24%), etc. This uncertainty is reflected in the *relatively even distribution* of probabilities.

In contrast, when using the expanded sentence, "they" (73%) becomes the dominant prediction with much higher confidence. This suggests that the model has recognized references to "viewers" in the preceding text, making "they" the choice in this context. The probabilities for "you" and "we" decrease, highlighting the fact that the model can now discenr that the sentence is referring to a third-person group.

This shift in predictions demonstrates how added context help resolve this ambiguity. This is a good example of how BERT makes use of contextual information to refine its word predictions. 

#### Generate sentence embeddings using SBERT

SBERT (Sentence-BERT) is a variant of the BERT model designed specifically for generating **sentence embeddings**.

Unlike traditional BERT, which as we saw earlier is optimized for token-level tasks, SBERT fine-tunes BERT to produce meaningful sentence-level representations.

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
# Load a pre-trained SBERT model

sbert_model = SentenceTransformer('all-MiniLM-L6-v2') 

all-MiniLM-L6-v2 is a lightweight SBERT model 

In [None]:
# For each article, generate embeddings

df['sbert_embedding'] = df['processed_text'].apply(lambda x: sbert_model.encode(x))

<ins>References<ins>:

[1] https://www.ibm.com/docs/en/watsonx/saas?topic=embeddings-text-overview

[2] https://spacy.io/usage/processing-pipelines

[3] https://huggingface.co/docs/transformers/en/index
