# Implementing TF-IDF in Python with nltk

To implement **TF-IDF in Python**, you typically follow a few core steps. First, you need to **preprocess your text documents**, which includes essential techniques like tokenization, stopword removal, and stemming. After preprocessing, you can calculate the TF-IDF scores using `TfidfVectorizer` from `sklearn.feature_extraction.text`. This class efficiently transforms your documents into **TF-IDF feature vectors**, which are then ready for subsequent text analysis tasks such as classification or clustering.

---

In [1]:
import nltk # Imports the Natural Language Toolkit library.
from sklearn.feature_extraction.text import TfidfVectorizer # Imports TfidfVectorizer for converting text to TF-IDF features.

# Sample documents for demonstration.
sample_documents = [
    "I love to play soccer",
    "Soccer is my favorite sport",
    "I enjoy playing soccer with my friends",
    "Football is another popular sport",
    "I don't like basketball"
]

# Create the TF-IDF vectorizer object.
# This object will learn the vocabulary and IDF values from the documents,
# and then transform the documents into TF-IDF numerical representations.
tfidf_vectorizer = TfidfVectorizer()

# Compute the TF-IDF scores for the sample documents.
# 'fit_transform' first learns the vocabulary and IDF values from 'sample_documents',
# then transforms these documents into a sparse matrix of TF-IDF scores.
tfidf_scores_matrix = tfidf_vectorizer.fit_transform(sample_documents)

# Get the names of the features (terms) from the vectorizer's learned vocabulary.
# These correspond to the columns in the TF-IDF matrix.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF scores for each document.
# 'enumerate' is used to get both the index (i) and the document content (doc).
for i, doc in enumerate(sample_documents):
    print("Document:", doc) # Print the original document text.
    # 'enumerate' is used again to get the index (j) and the term (feature_name).
    for j, term in enumerate(feature_names):
        # Access the TF-IDF score for the current document (i) and current term (j).
        score = tfidf_scores_matrix[i, j]
        # Only print terms that have a non-zero TF-IDF score in the current document.
        if score > 0:
            print(f"  {term}: {score:.4f}") # Format score to 4 decimal places for readability.
    print() # Print an empty line for better separation between document outputs.

Document: I love to play soccer
  love: 0.5385
  play: 0.5385
  soccer: 0.3606
  to: 0.5385

Document: Soccer is my favorite sport
  favorite: 0.5422
  is: 0.4375
  my: 0.4375
  soccer: 0.3631
  sport: 0.4375

Document: I enjoy playing soccer with my friends
  enjoy: 0.4428
  friends: 0.4428
  my: 0.3573
  playing: 0.4428
  soccer: 0.2966
  with: 0.4428

Document: Football is another popular sport
  another: 0.4821
  football: 0.4821
  is: 0.3890
  popular: 0.4821
  sport: 0.3890

Document: I don't like basketball
  basketball: 0.5774
  don: 0.5774
  like: 0.5774



---
**TF-IDF scores** provide valuable insights into a term's importance within a document corpus. Understanding how to interpret these scores is key for various text mining and information retrieval tasks.

**High TF-IDF scores** indicate a term is **frequent in a specific document** but **relatively rare across the entire corpus**. This suggests the term is highly **distinctive and significant** to that particular document's content.

Conversely, **low TF-IDF scores** mean a term is **infrequent in a document** or **very common throughout the corpus**. These terms are typically less informative and contribute little to the unique understanding of a specific document (e.g., common words like "the," "and," or "is").

Interpreting TF-IDF also involves **comparing scores** across different terms and documents. By examining scores within a single document, we can identify its most differentiating terms. Comparing scores across different documents helps pinpoint terms that are highly relevant or characteristic of specific documents or topics. This analysis is crucial for tasks like document clustering, topic modeling, and information retrieval, aiding in the identification and extraction of key textual information.

---

### TF-IDF: Information Retrieval's Core

**TF-IDF** stands as a fundamental technique in **information retrieval (IR)**, pivotal for **ranking and retrieving relevant documents** in response to user queries. Search engines widely employ TF-IDF scores to effectively match query terms with document content, thereby delivering more precise search outcomes.

Its primary applications within IR include:

* **Document Ranking:** TF-IDF is instrumental in assessing a document's relevance to a given query. Documents with higher TF-IDF scores for the query's terms are prioritized and ranked higher in search results, ensuring users access the most pertinent information.
* **Keyword Extraction:** The technique is highly effective at identifying **key terms or phrases** within documents by pinpointing those with elevated TF-IDF scores. These distinctive words are crucial assets for tasks like document indexing, categorization, and topic labeling.

Ultimately, the inherent adaptability of TF-IDF significantly enhances capabilities in document ranking, keyword extraction, and overall information retrieval efficiency.

---

In [None]:
import nltk # Imports the Natural Language Toolkit library.
from sklearn.feature_extraction.text import TfidfVectorizer # Imports TfidfVectorizer to convert text to TF-IDF features.

# Sample documents for demonstration.a
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third.",
    "This is the first document?"
]

# Preprocess the documents.
# Each document is tokenized (split into words) after being converted to lowercase.
processed_corpus = [nltk.word_tokenize(document.lower()) for document in documents]
# Convert the preprocessed documents (list of tokens) back into strings, joined by spaces.
processed_corpus = [' '.join(doc_tokens) for doc_tokens in processed_corpus]

# Create the TF-IDF vectorizer.
tfidf_vectorizer = TfidfVectorizer()

# Compute the TF-IDF scores.
# 'fit_transform' learns the vocabulary and IDF values from the 'processed_corpus',
# then transforms these documents into a sparse matrix of TF-IDF scores.
tfidf_scores_matrix = tfidf_vectorizer.fit_transform(processed_corpus)

# Get the names of the features (words) from the vectorizer's learned vocabulary.
# These correspond to the columns in the TF-IDF matrix.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF scores for each document.
# 'enumerate' is used to get both the document index and its corresponding scores.
# '.toarray()' converts the sparse matrix row to a dense NumPy array for easier iteration.
for doc_index, doc_scores in enumerate(tfidf_scores_matrix.toarray()):
    print(f"Document {doc_index + 1}:") # Print the document number.
    # 'enumerate' is used again to get the word index and its score within the current document.
    for word_index, score in enumerate(doc_scores):
        # Only print terms that have a non-zero TF-IDF score in the current document.
        if score > 0:
            # Print the word (feature_name) and its TF-IDF score, formatted to 4 decimal places.
            print(f"\tWord: {feature_names[word_index]}, TF-IDF Score: {score:.4f}")
    print() # Print an empty line for better separation between document outputs.

Document 1:
	Word: document, TF-IDF Score: 0.4698
	Word: first, TF-IDF Score: 0.5803
	Word: is, TF-IDF Score: 0.3841
	Word: the, TF-IDF Score: 0.3841
	Word: this, TF-IDF Score: 0.3841

Document 2:
	Word: document, TF-IDF Score: 0.6876
	Word: is, TF-IDF Score: 0.2811
	Word: second, TF-IDF Score: 0.5386
	Word: the, TF-IDF Score: 0.2811
	Word: this, TF-IDF Score: 0.2811

Document 3:
	Word: and, TF-IDF Score: 0.5958
	Word: is, TF-IDF Score: 0.3109
	Word: the, TF-IDF Score: 0.3109
	Word: third, TF-IDF Score: 0.5958
	Word: this, TF-IDF Score: 0.3109

Document 4:
	Word: document, TF-IDF Score: 0.4698
	Word: first, TF-IDF Score: 0.5803
	Word: is, TF-IDF Score: 0.3841
	Word: the, TF-IDF Score: 0.3841
	Word: this, TF-IDF Score: 0.3841



---

This example demonstrates how to calculate **TF-IDF scores** for document classification using Python. We start by importing necessary libraries like **NLTK** and **`TfidfVectorizer` from scikit-learn**.

We then define a list of sample documents and **preprocess** them using NLTK's `word_tokenize` to split them into words after converting to lowercase. These preprocessed tokens are then joined back into strings for the `TfidfVectorizer`.

Next, we create a `TfidfVectorizer` instance and use its `fit_transform` method on our processed documents to **compute the TF-IDF scores**. We also retrieve the **feature names (words)** learned by the vectorizer.

Finally, the code iterates through each document and its calculated TF-IDF scores, printing only the **non-zero scores** to highlight the most relevant terms for each document. This provides a clear visualization of how TF-IDF identifies important words.

---

## TF-IDF for keyword extraction

In [1]:
import nltk # Imports the Natural Language Toolkit library.
from sklearn.feature_extraction.text import TfidfVectorizer # Imports TfidfVectorizer for converting text to TF-IDF features.

# Sample document for keyword extraction.
sample_document = "This is a sample document that contains some keywords like apple, apple, banana, banana, and orange, orange."

# Preprocess the document.
# Convert the document to lowercase and then tokenize it (split into words).
tokens = nltk.word_tokenize(sample_document.lower())

# Create the TF-IDF vectorizer.
# Note: TfidfVectorizer expects a list of strings as input, so we pass the tokenized
# document rejoined into a single string within a list.
tfidf_vectorizer = TfidfVectorizer()

# Calculate the TF-IDF scores.
# 'fit_transform' learns the vocabulary from the document and computes the TF-IDF scores.
# Since we have only one document, the corpus for fit_transform is a list containing that single document.
tfidf_scores_matrix = tfidf_vectorizer.fit_transform([' '.join(tokens)])

# Get the names of the features (words) from the vectorizer's learned vocabulary.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Create a dictionary to store keywords and their TF-IDF scores.
keywords_with_scores = {}

# Get TF-IDF scores for each word in the document.
# tfidf_scores_matrix is a sparse matrix. tfidf_scores_matrix.indices gives the column indices
# of non-zero elements, and tfidf_scores_matrix.data gives their corresponding values.
# Since we have only one document, we access the first (and only) row's data.
for word_index, score in zip(tfidf_scores_matrix.indices, tfidf_scores_matrix.data):
    word = feature_names[word_index]
    # Store the word and its TF-IDF score.
    # If a word appears multiple times, its score will be the same TF-IDF score
    # calculated for that word in the document. We store it once.
    if word not in keywords_with_scores: # This check is actually redundant here as each index appears once,
                                      # but it's good practice for general scenarios.
        keywords_with_scores[word] = score

# Sort the keywords based on their TF-IDF scores in descending order.
# 'items()' returns key-value pairs, 'key=lambda x: x[1]' sorts by the score (value),
# and 'reverse=True' ensures descending order (highest score first).
sorted_keywords = sorted(keywords_with_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top 5 keywords and their TF-IDF scores.
print("Top 5 Keywords and their TF-IDF Scores:")
for keyword, score in sorted_keywords[:5]:
    print(f"Keyword: {keyword}, TF-IDF Score: {score:.4f}")

Top 5 Keywords and their TF-IDF Scores:
Keyword: apple, TF-IDF Score: 0.4264
Keyword: banana, TF-IDF Score: 0.4264
Keyword: orange, TF-IDF Score: 0.4264
Keyword: this, TF-IDF Score: 0.2132
Keyword: is, TF-IDF Score: 0.2132


---

This example illustrates how to use **NLTK** and **scikit-learn** to calculate TF-IDF scores for **keyword extraction**.

We begin by importing the necessary libraries and defining a **sample document**. The document is then **preprocessed** using NLTK's `word_tokenize` to convert it into tokens. An instance of `TfidfVectorizer` is created to handle the TF-IDF calculations.

The `fit_transform` method of the `TfidfVectorizer` is called on our processed document to **compute the TF-IDF scores**, and the **feature names (words)** are retrieved. These scores, along with their corresponding words, are then stored in a dictionary.

Finally, the keywords are **sorted by their TF-IDF scores** in descending order, and the top keywords are printed, showcasing the most relevant terms in the document.

---

## TF-IDF for text classification

In [2]:
import nltk # Imports the Natural Language Toolkit library.
from sklearn.feature_extraction.text import TfidfVectorizer # Imports TfidfVectorizer to convert text into TF-IDF features.
from sklearn.model_selection import train_test_split # Imports train_test_split for splitting data into training and testing sets.
from sklearn.svm import LinearSVC # Imports Linear Support Vector Classification, a common classifier for text data.

# Sample dataset
# Each tuple contains a document text and its corresponding label (e.g., "positive", "negative").
documents = [
    ("This is a positive review", "positive"),
    ("I do not like this product", "negative"),
    ("This movie is fantastic", "positive"),
    ("The service was terrible", "negative"),
]

# Preprocess the documents
# Separate the text content (corpus) from their labels.
corpus = [doc[0] for doc in documents] # Extracts all document texts.
labels = [doc[1] for doc in documents] # Extracts all corresponding labels.

# Create the TF-IDF vectorizer
# This object will convert the text documents into numerical TF-IDF feature vectors.
tfidf_vectorizer = TfidfVectorizer()

# Split the dataset into training and testing sets.
# test_size=0.2 means 20% of the data will be used for testing, 80% for training.
# random_state=42 ensures reproducibility of the split.
X_train, X_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2, random_state=42)

# Calculate TF-IDF features for the training set.
# 'fit_transform' learns the vocabulary and IDF weights from the training data,
# then transforms the training texts into TF-IDF vectors.
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test set using the *learned* vocabulary and IDF weights from the training set.
# 'transform' is used here (not 'fit_transform') to avoid data leakage from the test set.
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train a classifier (e.g., LinearSVC).
# LinearSVC is a type of Support Vector Machine suitable for text classification.
classifier = LinearSVC()
classifier.fit(X_train_tfidf, y_train) # Train the classifier using the TF-IDF features and labels from the training set.

# Predict categories for the test set.
# The trained classifier makes predictions on the unseen test data.
predictions = classifier.predict(X_test_tfidf)

# Print the predictions.
# Iterate through the original test texts and their predicted labels.
for i, prediction in enumerate(predictions):
    print(f"Text: {X_test[i]}, Predicted label: {prediction}")

# Note: Since the dataset is very small, the SVC classifier might make errors
# or the split might result in a test set with very few or no examples for some classes,
# affecting performance. This code is primarily for demonstration purposes.

Text: I do not like this product, Predicted label: positive


---

This example demonstrates a basic **text classification pipeline** using TF-IDF and a LinearSVC classifier.

We start by importing essential libraries: **NLTK**, `TfidfVectorizer`, and `LinearSVC` from scikit-learn. A small **sample dataset** of text documents with corresponding positive or negative labels is defined.

We then prepare the data by creating a `TfidfVectorizer` instance and splitting the dataset into **training and testing sets** using `train_test_split`. TF-IDF features are calculated for the training set using `fit_transform`, and then *transformed* for the test set.

Finally, a **LinearSVC classifier is trained** on the TF-IDF features of the training data. The trained classifier then **predicts labels for the test set**, and these predictions are printed. This showcases a complete, albeit simple, workflow for text classification.

---

---

### Summary: Challenges and Considerations for TF-IDF

While **TF-IDF** is a widely used technique for text analysis, it comes with several challenges and considerations for researchers and practitioners. These often stem from the nature of the TF-IDF approach and the characteristics of the text data being analyzed.

#### Document Length Bias

One significant challenge is the **bias towards longer documents**. Since term frequency (TF) is directly proportional to document length, longer documents tend to have higher term frequencies, potentially skewing TF-IDF scores.

To counter this, **normalization techniques** are applied:

* **Simple Normalization:** Dividing the term's frequency by the total document length. For example, if "analysis" appears 10 times in a 100-word document (TF = 0.1) and 20 times in a 200-word document (TF = 0.1), normalization reveals the term has the same *proportional* importance, regardless of the document's size.
* **Sublinear Term Frequency Scaling:** Applying a logarithmic transformation to the term frequency (e.g., $1 + \log(\text{term frequency})$). This helps reduce the impact of very high frequencies in longer documents, making terms like "analysis" (10 occurrences $\rightarrow$ $1 + \log(10) = 2$; 20 occurrences $\rightarrow$ $1 + \log(20) \approx 2.3$) appear with more comparable importance.

The choice of normalization method should align with the dataset's characteristics and analysis objectives.

#### Stopwords and Rare Terms

Another challenge is handling **stopwords** (common words like "the," "and," "is") which have high frequencies but low informational value. It's standard practice to **remove stopwords** to focus on more meaningful terms. Conversely, **rare terms** with very low frequencies might not provide significant insights and can introduce noise, so they are often filtered out.

#### Vocabulary Size

The **size of the vocabulary** (number of unique terms) in a document collection directly impacts the computational complexity and memory requirements for TF-IDF calculations. Large vocabularies can lead to increased processing time and memory usage. To mitigate this, **feature selection techniques** can be employed to select a subset of the most informative terms or to reduce the dimensionality of the TF-IDF matrix, improving efficiency.

---

---

When working with TF-IDF, the size of the vocabulary can lead to high-dimensional matrices, impacting computational efficiency. A common **feature selection technique** to address this is **Variance Threshold**.

This method aims to **remove terms with low variance** across documents in the TF-IDF matrix. The rationale is that terms whose TF-IDF scores show little variation between documents are not very useful for distinguishing one document from another and thus can be considered less informative.

For instance, if you have a TF-IDF matrix of 1000 terms across 100 documents, you would calculate the variance of each term's TF-IDF score across all documents. By setting a **variance threshold** (e.g., 0.01), any term with a variance below this limit is removed. If the term "analysis" has a variance of 0.005 (below 0.01), it would be eliminated, thereby **reducing the matrix's dimensionality** and focusing on terms that truly differentiate content.

While Variance Threshold is a straightforward way to improve TF-IDF efficiency, other feature selection techniques like SelectKBest, Chi-squared, and Mutual Information are also available, each with distinct criteria for selecting the most informative terms.

---

---

### TF-IDF Effectiveness in Diverse Scenarios

The effectiveness of **TF-IDF** as a text analysis technique varies significantly based on the characteristics of the data and the specific task at hand.

In **document classification**, TF-IDF has proven highly effective in identifying important features and improving classification accuracy. By assigning higher weights to rare terms that are particularly discriminative for a given class, TF-IDF helps differentiate between categories and enhances model performance. Researchers have consistently demonstrated TF-IDF's ability to boost the performance of classification models across various text classification tasks, including **sentiment analysis, topic categorization, and spam detection.**

---