# Implementing TF-IDF in Python with nltk

To implement **TF-IDF in Python**, you typically follow a few core steps. First, you need to **preprocess your text documents**, which includes essential techniques like tokenization, stopword removal, and stemming. After preprocessing, you can calculate the TF-IDF scores using `TfidfVectorizer` from `sklearn.feature_extraction.text`. This class efficiently transforms your documents into **TF-IDF feature vectors**, which are then ready for subsequent text analysis tasks such as classification or clustering.

---

In [1]:
import nltk # Imports the Natural Language Toolkit library.
from sklearn.feature_extraction.text import TfidfVectorizer # Imports TfidfVectorizer for converting text to TF-IDF features.

# Sample documents for demonstration.
sample_documents = [
    "I love to play soccer",
    "Soccer is my favorite sport",
    "I enjoy playing soccer with my friends",
    "Football is another popular sport",
    "I don't like basketball"
]

# Create the TF-IDF vectorizer object.
# This object will learn the vocabulary and IDF values from the documents,
# and then transform the documents into TF-IDF numerical representations.
tfidf_vectorizer = TfidfVectorizer()

# Compute the TF-IDF scores for the sample documents.
# 'fit_transform' first learns the vocabulary and IDF values from 'sample_documents',
# then transforms these documents into a sparse matrix of TF-IDF scores.
tfidf_scores_matrix = tfidf_vectorizer.fit_transform(sample_documents)

# Get the names of the features (terms) from the vectorizer's learned vocabulary.
# These correspond to the columns in the TF-IDF matrix.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the TF-IDF scores for each document.
# 'enumerate' is used to get both the index (i) and the document content (doc).
for i, doc in enumerate(sample_documents):
    print("Document:", doc) # Print the original document text.
    # 'enumerate' is used again to get the index (j) and the term (feature_name).
    for j, term in enumerate(feature_names):
        # Access the TF-IDF score for the current document (i) and current term (j).
        score = tfidf_scores_matrix[i, j]
        # Only print terms that have a non-zero TF-IDF score in the current document.
        if score > 0:
            print(f"  {term}: {score:.4f}") # Format score to 4 decimal places for readability.
    print() # Print an empty line for better separation between document outputs.

Document: I love to play soccer
  love: 0.5385
  play: 0.5385
  soccer: 0.3606
  to: 0.5385

Document: Soccer is my favorite sport
  favorite: 0.5422
  is: 0.4375
  my: 0.4375
  soccer: 0.3631
  sport: 0.4375

Document: I enjoy playing soccer with my friends
  enjoy: 0.4428
  friends: 0.4428
  my: 0.3573
  playing: 0.4428
  soccer: 0.2966
  with: 0.4428

Document: Football is another popular sport
  another: 0.4821
  football: 0.4821
  is: 0.3890
  popular: 0.4821
  sport: 0.3890

Document: I don't like basketball
  basketball: 0.5774
  don: 0.5774
  like: 0.5774

