## Bag-of-Words (BoW) vs. Term Frequency-Inverse Document Frequency (TF-IDF):  Comparing Text Feature Extraction Techniques

In this project we explore two fundamental text feature extraction techniques to understand their impact on the performance of a text classification task. Utilizing a dataset of movie reviews, we aim to classify the sentiment of each review as either positive or negative.

* **Feature Extraction Implementation**: applying BoW and TF-IDF techniques to convert textual data into numerical features suitable for ML algorithms.
* **Model Training and Evaluation**: training a simple logistic regression classifier on the features extracted by each method and evaluate their performance in terms of accuracy, precision, recall, and F1-score.
* **Performance Comparison**: compareing the effectiveness of BoW and TF-IDF in capturing relevant information for sentiment analysis

In [7]:
import numpy as np

# To access ML methods
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# To access datasets and text processing methods
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

# Stopwords are common words (e.g., "the", "is", "in") that do not contribute much to the meaning of a document
nltk.download('stopwords') # ensures that the stopwords dataset is available in our local NLTK data directory
from nltk.corpus import stopwords # imports the module so we can access and use the stopwords in our code

import string

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Load movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
np.random.shuffle(documents)

In [3]:
# Prepare dataset
texts = [" ".join(doc) for doc, _ in documents]
labels = [label for _, label in documents]

In [4]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.25, random_state=42)


In [5]:
# Define a function for model training and evaluation
def train_evaluate(features_train, features_test, y_train, y_test):
    model = LogisticRegression(max_iter=1000)
    model.fit(features_train, y_train)
    predictions = model.predict(features_test)
    accuracy = accuracy_score(y_test, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, predictions, average='binary', pos_label='pos')
    print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1-score: {f1:.4f}")

### Let's compare two vectorization approaches: BoW vs. TF-IDF

In [8]:
# Bag-of-Words (BoW) vectorization (CountVectorizer) represents a document
# as a vector where each dimension corresponds to a word from the vocabulary,
# and the value in each dimension is the count of the word's occurrence in the document.

bow_vectorizer = CountVectorizer(stop_words=stopwords.words('english') + list(string.punctuation))
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

print("Bag-of-Words Results:")
train_evaluate(X_train_bow, X_test_bow, y_train, y_test)

Bag-of-Words Results:
Accuracy: 0.8640, Precision: 0.8970, Recall: 0.8261, F1-score: 0.8601


In [9]:
# TF-IDF (Term Frequency-Inverse Document Frequency) vectorization (TfidfVectorizer)
# also represents documents as vectors, but instead of raw counts, it uses TF-IDF scores
# that reflect how important a word is to a document in a collection of documents.
# This helps to adjust for the fact that some words appear more frequently in general
# and may not be as meaningful in distinguishing between documents.

# TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation))
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("\nTF-IDF Results:")
train_evaluate(X_train_tfidf, X_test_tfidf, y_train, y_test)


TF-IDF Results:
Accuracy: 0.8400, Precision: 0.8502, Recall: 0.8300, F1-score: 0.8400


The BoW model performed slightly better across all metrics in this instance. This could be due to various factors, including the nature of the dataset, the distribution of words, and how well each representation captures the relevant information for the classification task.