# Implementing TF-IDF in Python with nltk

To implement **TF-IDF in Python**, you typically follow a few core steps. First, you need to **preprocess your text documents**, which includes essential techniques like tokenization, stopword removal, and stemming. After preprocessing, you can calculate the TF-IDF scores using `TfidfVectorizer` from `sklearn.feature_extraction.text`. This class efficiently transforms your documents into **TF-IDF feature vectors**, which are then ready for subsequent text analysis tasks such as classification or clustering.

---

In [1]:
import nltk # Imports the Natural Language Toolkit library.
from sklearn.feature_extraction.text import TfidfVectorizer # Imports TfidfVectorizer for converting text to TF-IDF features.

# Sample documents for demonstration.
sample_documents = [
    "I love to play soccer",
    "Soccer is my favorite sport",
    "I enjoy playing soccer with my friends",
    "Football is another popular sport",
    "I don't like basketball"
]

# Create the TF-IDF vectorizer object.
# This object will learn the vocabulary and IDF values from the documents,
# and then transform the documents into TF-IDF numerical representations.
tfidf_vectorizer = TfidfVectorizer()

# Compute the TF-IDF scores for the sample documents.
# 'fit_transform' first learns the vocabulary and IDF values from 'sample_documents',
# then transforms these documents into a sparse matrix of TF-IDF scores.
tfidf_scores_matrix = tfidf_vectorizer.fit_transform(sample_documents)

# Get the names of the features (terms) from the vectorizer's learned vocabulary.
# These correspond to the columns in the TF-IDF matrix.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF scores for each document.
# 'enumerate' is used to get both the index (i) and the document content (doc).
for i, doc in enumerate(sample_documents):
    print("Document:", doc) # Print the original document text.
    # 'enumerate' is used again to get the index (j) and the term (feature_name).
    for j, term in enumerate(feature_names):
        # Access the TF-IDF score for the current document (i) and current term (j).
        score = tfidf_scores_matrix[i, j]
        # Only print terms that have a non-zero TF-IDF score in the current document.
        if score > 0:
            print(f"  {term}: {score:.4f}") # Format score to 4 decimal places for readability.
    print() # Print an empty line for better separation between document outputs.

Document: I love to play soccer
  love: 0.5385
  play: 0.5385
  soccer: 0.3606
  to: 0.5385

Document: Soccer is my favorite sport
  favorite: 0.5422
  is: 0.4375
  my: 0.4375
  soccer: 0.3631
  sport: 0.4375

Document: I enjoy playing soccer with my friends
  enjoy: 0.4428
  friends: 0.4428
  my: 0.3573
  playing: 0.4428
  soccer: 0.2966
  with: 0.4428

Document: Football is another popular sport
  another: 0.4821
  football: 0.4821
  is: 0.3890
  popular: 0.4821
  sport: 0.3890

Document: I don't like basketball
  basketball: 0.5774
  don: 0.5774
  like: 0.5774



---
**TF-IDF scores** provide valuable insights into a term's importance within a document corpus. Understanding how to interpret these scores is key for various text mining and information retrieval tasks.

**High TF-IDF scores** indicate a term is **frequent in a specific document** but **relatively rare across the entire corpus**. This suggests the term is highly **distinctive and significant** to that particular document's content.

Conversely, **low TF-IDF scores** mean a term is **infrequent in a document** or **very common throughout the corpus**. These terms are typically less informative and contribute little to the unique understanding of a specific document (e.g., common words like "the," "and," or "is").

Interpreting TF-IDF also involves **comparing scores** across different terms and documents. By examining scores within a single document, we can identify its most differentiating terms. Comparing scores across different documents helps pinpoint terms that are highly relevant or characteristic of specific documents or topics. This analysis is crucial for tasks like document clustering, topic modeling, and information retrieval, aiding in the identification and extraction of key textual information.

---

### TF-IDF: Information Retrieval's Core

**TF-IDF** stands as a fundamental technique in **information retrieval (IR)**, pivotal for **ranking and retrieving relevant documents** in response to user queries. Search engines widely employ TF-IDF scores to effectively match query terms with document content, thereby delivering more precise search outcomes.

Its primary applications within IR include:

* **Document Ranking:** TF-IDF is instrumental in assessing a document's relevance to a given query. Documents with higher TF-IDF scores for the query's terms are prioritized and ranked higher in search results, ensuring users access the most pertinent information.
* **Keyword Extraction:** The technique is highly effective at identifying **key terms or phrases** within documents by pinpointing those with elevated TF-IDF scores. These distinctive words are crucial assets for tasks like document indexing, categorization, and topic labeling.

Ultimately, the inherent adaptability of TF-IDF significantly enhances capabilities in document ranking, keyword extraction, and overall information retrieval efficiency.

---

In [None]:
import nltk # Imports the Natural Language Toolkit library.
from sklearn.feature_extraction.text import TfidfVectorizer # Imports TfidfVectorizer to convert text to TF-IDF features.

# Sample documents for demonstration.a
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third.",
    "This is the first document?"
]

# Preprocess the documents.
# Each document is tokenized (split into words) after being converted to lowercase.
processed_corpus = [nltk.word_tokenize(document.lower()) for document in documents]
# Convert the preprocessed documents (list of tokens) back into strings, joined by spaces.
processed_corpus = [' '.join(doc_tokens) for doc_tokens in processed_corpus]

# Create the TF-IDF vectorizer.
tfidf_vectorizer = TfidfVectorizer()

# Compute the TF-IDF scores.
# 'fit_transform' learns the vocabulary and IDF values from the 'processed_corpus',
# then transforms these documents into a sparse matrix of TF-IDF scores.
tfidf_scores_matrix = tfidf_vectorizer.fit_transform(processed_corpus)

# Get the names of the features (words) from the vectorizer's learned vocabulary.
# These correspond to the columns in the TF-IDF matrix.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF scores for each document.
# 'enumerate' is used to get both the document index and its corresponding scores.
# '.toarray()' converts the sparse matrix row to a dense NumPy array for easier iteration.
for doc_index, doc_scores in enumerate(tfidf_scores_matrix.toarray()):
    print(f"Document {doc_index + 1}:") # Print the document number.
    # 'enumerate' is used again to get the word index and its score within the current document.
    for word_index, score in enumerate(doc_scores):
        # Only print terms that have a non-zero TF-IDF score in the current document.
        if score > 0:
            # Print the word (feature_name) and its TF-IDF score, formatted to 4 decimal places.
            print(f"\tWord: {feature_names[word_index]}, TF-IDF Score: {score:.4f}")
    print() # Print an empty line for better separation between document outputs.

Document 1:
	Word: document, TF-IDF Score: 0.4698
	Word: first, TF-IDF Score: 0.5803
	Word: is, TF-IDF Score: 0.3841
	Word: the, TF-IDF Score: 0.3841
	Word: this, TF-IDF Score: 0.3841

Document 2:
	Word: document, TF-IDF Score: 0.6876
	Word: is, TF-IDF Score: 0.2811
	Word: second, TF-IDF Score: 0.5386
	Word: the, TF-IDF Score: 0.2811
	Word: this, TF-IDF Score: 0.2811

Document 3:
	Word: and, TF-IDF Score: 0.5958
	Word: is, TF-IDF Score: 0.3109
	Word: the, TF-IDF Score: 0.3109
	Word: third, TF-IDF Score: 0.5958
	Word: this, TF-IDF Score: 0.3109

Document 4:
	Word: document, TF-IDF Score: 0.4698
	Word: first, TF-IDF Score: 0.5803
	Word: is, TF-IDF Score: 0.3841
	Word: the, TF-IDF Score: 0.3841
	Word: this, TF-IDF Score: 0.3841

