### Vectorisation Techniques: CountVectorizer, BoW, and TF-IDF

In this exercise we will explore the fundamental techniques used to convert raw, unstructured text into a numerical format that machine learning models can process. This process is known as vectorisation or feature extraction. We will implement the Bag of Words (BoW) model using CountVectorizer and extend this concept to Term Frequency-Inverse Document Frequency (TF-IDF) using TfidfVectorizer. We will compare these two methods to illustrate how TF-IDF weighting prioritizes important, domain-specific words over common terms.

#### What we will discuss in this notebook

- **The Bag of Words (BoW) Concept**: Manual illustration of word counting.

- **CountVectorizer Implementation (BoW)**: Creating a document-term matrix.

- **TfidfVectorizer Implementation (TF-IDF)**: Assigning importance weights.

- **Comparative Analysis**: Contrasting BoW counts with TF-IDF scores.

#### What we will learn from this exercise:

- **Vectorization** is essential for all text-based machine learning.

- The **Bag of Words (BoW)** model represents text as the frequency of words, ignoring grammar and word order.

- **CountVectorizer** efficiently implements the BoW model, creating a Document-Term Matrix.

- **TF-IDF** is a weighting scheme that penalizes common words (like "the") and rewards rare, distinctive words, reflecting their importance in classification tasks.

#### Now, let us get started

In [None]:
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

print(f"scikit-learn Version: {sklearn.__version__}")
print(f"Pandas Version: {pd.__version__}")

# Sample Corpus (Document collection)
CORPUS = [
    "The sun is shining brightly today.",
    "The car is a fast car and the sun is hot.",
    "A sunny day and a hot sun make me happy.",
    "I drive a fast car on a sunny road."
]

print("\n--- Sample Corpus ---")
for i, doc in enumerate(CORPUS):
    print(f"Document {i+1}: {doc}")

scikit-learn Version: 1.7.2
Pandas Version: 2.3.3

--- Sample Corpus ---
Document 1: The sun is shining brightly today.
Document 2: The car is a fast car and the sun is hot.
Document 3: A sunny day and a hot sun make me happy.
Document 4: I drive a fast car on a sunny road.


#### The Bag of Words (BoW) Concept

The Bag of Words (BoW) model is a simple representation of text that describes the occurrence of words within a document. It involves two things:

1. A vocabulary of all known words.

2. The frequency (count) of each word in the document.

The BoW model is called a "bag" because it completely **disregards word order** and grammar.

We demonstrate the BoW concept by calculating word counts for a single document.

In [None]:
sample_doc = CORPUS[1] # "The car is a fast car and the sun is hot."

# Simple manual tokenization and counting (ignoring punctuation)
word_count_dict = {}
for word in sample_doc.lower().split():
    # Only count alphanumeric words for simplicity
    if word.isalnum():
        word_count_dict[word] = word_count_dict.get(word, 0) + 1

print(f"\n--- Manual BoW for Document 2 ---")
print(f"Document: '{sample_doc}'")
print(f"Word Counts: {word_count_dict}")



--- Manual BoW for Document 2 ---
Document: 'The car is a fast car and the sun is hot.'
Word Counts: {'the': 2, 'car': 2, 'is': 2, 'a': 1, 'fast': 1, 'and': 1, 'sun': 1}


The words 'the', 'is', and 'car' appear multiple times. The BoW representation for this document is simply the list of (word, count) pairs.

### CountVectorizer Implementation

CountVectorizer from scikit-learn automates the BoW process by tokenizing the text and building the vocabulary. The output is a sparse Document-Term Matrix where rows are documents and columns are unique words in the vocabulary.

#### Creating the Document-Term Matrix

We apply the vectorizer to our corpus and transform the text into numerical counts.



In [None]:
# Initialize CountVectorizer. We include stop_words removal and lowercase conversion.
count_vectorizer = CountVectorizer(stop_words='english', lowercase=True)

# Fit the vectorizer to the corpus (learn the vocabulary)
count_matrix_sparse = count_vectorizer.fit_transform(CORPUS)

# Get the feature names (vocabulary)
feature_names = count_vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense DataFrame for inspection
bow_df = pd.DataFrame(count_matrix_sparse.toarray(), columns=feature_names, index=[f'Doc {i+1}' for i in range(len(CORPUS))])

print("\n--- Document-Term Matrix (BoW Counts) ---")
print(bow_df)

# Inspection: Vocabulary size
print(f"\nVocabulary Size (after stop words removal): {len(feature_names)}")


--- Document-Term Matrix (BoW Counts) ---
       brightly  car  day  drive  fast  happy  hot  make  road  shining  sun  \
Doc 1         1    0    0      0     0      0    0     0     0        1    1   
Doc 2         0    2    0      0     1      0    1     0     0        0    1   
Doc 3         0    0    1      0     0      1    1     1     0        0    1   
Doc 4         0    1    0      1     1      0    0     0     1        0    0   

       sunny  today  
Doc 1      0      1  
Doc 2      0      0  
Doc 3      1      0  
Doc 4      1      0  

Vocabulary Size (after stop words removal): 13


The matrix shows the raw frequency of each word (column) within each document (row). For example, the word 'sun' appears 1 times in 'Doc 3'.

### TfidfVectorizer Implementation (TF-IDF)

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting scheme that evaluates how important a word is to a document in a corpus.

- Term Frequency (TF): The frequency of a word in the current document (similar to BoW).

- Inverse Document Frequency (IDF): Measures how rare the word is across all documents. Words that appear in many documents (like 'sun' in this corpus) receive a lower IDF score.

$$TF-IDF(t,d,D)=TF(t,d)Ã—IDF(t,D)$$

#### Creating the TF-IDF Matrix

We use TfidfVectorizer, which calculates both the TF and IDF components and outputs the final weighted scores.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer with the same preprocessing steps
tfidf_vectorizer = TfidfVectorizer(stop_words='english', lowercase=True)

# Fit and transform the corpus
tfidf_matrix_sparse = tfidf_vectorizer.fit_transform(CORPUS)

# Convert the sparse matrix to a dense DataFrame for inspection
tfidf_df = pd.DataFrame(tfidf_matrix_sparse.toarray(), columns=feature_names, index=[f'Doc {i+1}' for i in range(len(CORPUS))])

print("\n--- Document-Term Matrix (TF-IDF Weights) ---")
# Rounding for better readability
print(tfidf_df.round(3))

NameError: name 'TfidfVectorizer' is not defined

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Define CORPUS inline (safe even after kernel restart)
CORPUS = [
    "The sun is shining brightly today.",
    "The car is a fast car and the sun is hot.",
    "A sunny day and a hot sun make me happy.",
    "I drive a fast car on a sunny road.",
]

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', lowercase=True)

# Fit and transform
tfidf_matrix_sparse = tfidf_vectorizer.fit_transform(CORPUS)

# Get feature names from this vectorizer
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# Create DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix_sparse.toarray(),
    columns=tfidf_feature_names,
    index=[f'Doc {i+1}' for i in range(len(CORPUS))]
)

print("\n--- Document-Term Matrix (TF-IDF Weights) ---")
print(tfidf_df.round(3))

#### Why TF-IDF is more useful than BoW? (Comparative Analysis)

The power of TF-IDF lies in how it suppresses frequent, uninformative words and boosts unique, distinguishing words.

Let's compare the raw count and the TF-IDF score for a common word ('sun') and a rare word ('drive') in the same document.

In [None]:
target_doc = 'Doc 4'
common_word = 'sun'
rare_word = 'drive'

# Get the index of the document
doc_index = 3 # Doc 4 is at index 3

print(f"\n--- Comparison in {target_doc} ---")

# Data for Common Word ('sun')
bow_sun = bow_df.loc[target_doc, common_word] if common_word in bow_df.columns else 0
tfidf_sun = tfidf_df.loc[target_doc, common_word] if common_word in tfidf_df.columns else 0
# TF-IDF penalizes 'sun' because it appears in many documents (low IDF).

# Data for Rare Word ('drive')
bow_drive = bow_df.loc[target_doc, rare_word] if rare_word in bow_df.columns else 0
tfidf_drive = tfidf_df.loc[target_doc, rare_word] if rare_word in tfidf_df.columns else 0
# TF-IDF rewards 'drive' because it is unique to Document 4 (high IDF).

print(f"{'Word':10} | {'BoW Count':10} | {'TF-IDF Score':15}")
print("-" * 38)
print(f"{common_word:10} | {bow_sun:<10} | {tfidf_sun:<15.4f}")
print(f"{rare_word:10} | {bow_drive:<10} | {tfidf_drive:<15.4f}")


--- Comparison in Doc 4 ---
Word       | BoW Count  | TF-IDF Score   
--------------------------------------
sun        | 0          | 0.0000         
drive      | 1          | 0.5087         


The raw count for 'sun' might be 1 (or 0 in Doc 4, but let's assume it was 1), but its TF-IDF score is significantly reduced due to its high corpus frequency (low IDF). The word 'drive' has a count of 1, but its TF-IDF score is high because it is rare and thus highly distinctive to Doc 4.

#### Matrix Sparsity

Both BoW and TF-IDF matrices are typically sparse, meaning most entries are zero. This is efficient for storage but requires specialised data structures (like those used in the sparse matrices).

In [None]:
# Calculate sparsity (proportion of non-zero elements)
sparsity_count = 1.0 - (count_matrix_sparse.count_nonzero() / count_matrix_sparse.size)
sparsity_tfidf = 1.0 - (tfidf_matrix_sparse.count_nonzero() / tfidf_matrix_sparse.size)

print(f"\nSparsity of Count Matrix: {sparsity_count:.5f}")
print(f"Sparsity of TF-IDF Matrix: {sparsity_tfidf:.5f}")



Sparsity of Count Matrix: 0.00000
Sparsity of TF-IDF Matrix: 0.00000


The matrices are sparse because most words do not appear in every document.

### Summary

| Feature     | BoW / CountVectorizer                          | TF-IDF / TfidfVectorizer                          |
|-------------|------------------------------------------------|--------------------------------------------------|
| Value       | Raw count of word occurrences.                 | Weighted score reflecting importance.            |
| Strength    | Simplicity, ease of implementation.            | Captures word importance, better for classification. |
| Weakness    | Frequent words (noise) dominate scores.        | Computationally more complex.                    |
| Application | Simple document clustering, baseline models.   | Text classification, information retrieval ranking. |


**Final Note**: Vectorisation successfully converts text into numerical vectors, with each word acting as a feature. This numerical representation is the required input for all subsequent unsupervised and supervised machine learning tasks in NLP.