<a href="https://colab.research.google.com/github/RDGopal/IB9LQ0-GenAI/blob/main/tf_idf_illustration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#tf-idf Illustration

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Example documents
documents = [
    "good product. I like it.",
    "product is easy to use. easy peasy.",
    "like how product works.",
    "not sure about the product use"
]

##Count Matrix

In [None]:
# Step 1: Count Matrix
# This matrix represents the raw count of terms in each document.
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(documents).toarray()
terms = count_vectorizer.get_feature_names_out()
count_df = pd.DataFrame(count_matrix, index=[f"Doc {i+1}" for i in range(len(documents))], columns=terms)
print("Count Matrix:")
print(count_df.round(3))
print("\n")


##Term Frequency Matrix

In [None]:
# Step 2: Term Frequency (TF) Matrix
# TF is calculated as term count divided by total terms in the document.
tf_matrix = count_matrix.astype(float)
for i in range(len(tf_matrix)):
    tf_matrix[i] /= tf_matrix[i].sum()
tf_df = pd.DataFrame(tf_matrix, index=[f"Doc {i+1}" for i in range(len(documents))], columns=terms)
print("Term Frequency (TF) Matrix:")
print(tf_df.round(3))
print("\n")


##Inverse Document Frequency (IDF) Matrix

In [None]:
# Step 3: Inverse Document Frequency (IDF) Matrix
# Traditional IDF formula: IDF(t) = log(N / DF(t))
# where N = total number of documents, DF(t) = number of documents containing term t.
N = len(documents)
df_counts = np.count_nonzero(count_matrix, axis=0)  # Document frequency for each term
idf_values = np.log(N / df_counts)  # Applying the traditional formula
idf_df = pd.DataFrame([idf_values], index=["IDF"], columns=terms)
print("Inverse Document Frequency (IDF) Matrix:")
print(idf_df.round(3))
print("\n")

## TF-IDF Matrix

In [None]:
# Step 4: TF-IDF Matrix
# TF-IDF is computed as: TF-IDF = TF * IDF
tfidf_matrix = tf_matrix * idf_values
tfidf_df = pd.DataFrame(tfidf_matrix, index=[f"Doc {i+1}" for i in range(len(documents))], columns=terms)
print("TF-IDF Matrix:")
print(tfidf_df.round(3))


##Eliminating Terms

Process for Eliminating Terms Based on TF-IDF Values
When reducing the number of terms, the goal is to remove uninformative words while keeping the most relevant ones. This is done by setting a threshold for TF-IDF values.

##Approaches for Selecting Important Terms

1. Maximum TF-IDF Scores Across All Documents

2. Sum of TF-IDF Scores Across All Documents

3. Average TF-IDF Score Across Documents

4. TF-IDF Variance Across Documents


Pre-filtering common stopwords (e.g., "the", "and") ensures that frequent but unimportant terms do not skew the results.

###Which Approach is Best?
* Use **max** TF-IDF when identifying unique keywords in a single document.
* Use **sum** TF-IDF when looking for the most important terms across the dataset.
* Use **average** TF-IDF when you want a balance between importance and spread across documents.
* Use **variance** TF-IDF when identifying words that are highly distinguishing for some documents but not all.


In [None]:
# Step 5: Select Top N Terms Based on User Input
num_terms_to_keep = int(input("Enter the number of terms to keep: "))
print("Choose the ranking method:")
print("1: Maximum TF-IDF value")
print("2: Sum of TF-IDF values across documents")
print("3: Average TF-IDF value across documents")
print("4: Variance of TF-IDF values across documents")
selection_method = int(input("Enter the method number: "))

In [None]:
# Compute ranking scores
if selection_method == 1:
    ranking_scores = tfidf_df.max(axis=0)
elif selection_method == 2:
    ranking_scores = tfidf_df.sum(axis=0)
elif selection_method == 3:
    ranking_scores = tfidf_df.mean(axis=0)
elif selection_method == 4:
    ranking_scores = tfidf_df.var(axis=0)
else:
    print("Invalid selection, defaulting to max TF-IDF.")
    ranking_scores = tfidf_df.max(axis=0)

# Select top N terms
selected_terms = ranking_scores.nlargest(num_terms_to_keep).index  # Keep top N terms

# Filter matrices to keep only selected terms
filtered_count_df = count_df[selected_terms]
filtered_tf_df = tf_df[selected_terms]
filtered_idf_df = idf_df[selected_terms]
filtered_tfidf_df = tfidf_df[selected_terms]

print(f"Top {num_terms_to_keep} Terms Selected")
print("Filtered Count Matrix:")
print(filtered_count_df.round(3))
print("\n")
print("Filtered TF Matrix:")
print(filtered_tf_df.round(3))
print("\n")
print("Filtered IDF Matrix:")
print(filtered_idf_df.round(3))
print("\n")
print("Filtered TF-IDF Matrix:")
print(filtered_tfidf_df.round(3))