# TF-IDF Tutorial

This notebook explains the concept of TF-IDF (Term Frequency-Inverse Document Frequency) and shows how to calculate it both manually and using the `scikit-learn` library.

TF-IDF is a widely used technique in natural language processing (NLP) and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (corpus).

**Why is TF-IDF useful?**

When we represent text as a Bag of Words, all words are treated equally. However, some words are more important than others for understanding the content of a document. For example, common words like "the", "a", or "is" appear very frequently but don't tell us much about the specific topic of a document. On the other hand, words that are unique to a particular document or a small set of documents are likely more important.

TF-IDF helps to address this by giving higher scores to words that are frequent in a document but rare in the overall collection of documents.

TF-IDF is calculated by multiplying two components:

1.  **Term Frequency (TF)**: How often a word appears in a single document.
2.  **Inverse Document Frequency (IDF)**: How rare or common a word is across the entire collection of documents.

Let's explore each of these components and then see how they are combined to calculate TF-IDF.

### Term Frequency (TF)

**Term Frequency (TF)** measures how frequently a term (word) appears in a document. Since a longer document might have a word appear more times than a shorter document, even if the word is not more important, TF is often normalized. A common way to normalize is to divide the number of times a word appears in a document by the total number of words in that document.

The formula for Term Frequency (TF) is:

$$ TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

Let's look at the Python code provided to calculate Term Frequency.

In [5]:
import math

def compute_tf(word, document):
    """
    Computes the Term Frequency (TF) for a given word in a document.

    Args:
        word (str): The word to compute TF for.
        document (list): A list of words representing the document.

    Returns:
        float: The TF value for the word in the document.
    """
    return document.count(word) / len(document)

# Example usage (from the provided code):
doc1 = "I love NLP NLP".split()
doc2 = "NLP loves Python".split()
all_docs = [doc1, doc2]

print(f"TF for 'NLP' in doc1: {compute_tf('NLP', doc1):.3f}")
print(f"TF for 'love' in doc2: {compute_tf('loves', doc2):.3f}") # Note: case sensitive without cleaning

TF for 'NLP' in doc1: 0.500
TF for 'love' in doc2: 0.333


### Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** measures how important a word is across the entire collection of documents (corpus). It downweights words that appear frequently in many documents and gives higher scores to words that are rare.

The idea is that words that appear in almost every document are likely not very informative for distinguishing between documents. Words that appear in only a few documents are more likely to be specific to those documents and therefore more important.

The formula for Inverse Document Frequency (IDF) is:

$$ IDF(t, D) = \log\left(\frac{\text{Total number of documents } |D|}{\text{Number of documents containing term } t}\right) $$

Where:
*   $|D|$ is the total number of documents in the corpus.
*   Number of documents containing term $t$ is the number of documents where the term $t$ appears at least once.

The logarithm is used to dampen the effect of very large numbers of documents.

Let's look at the Python code provided to calculate Inverse Document Frequency.

In [6]:
def compute_idf(word, documents):
    """
    Computes the Inverse Document Frequency (IDF) for a given word in a collection of documents.

    Args:
        word (str): The word to compute IDF for.
        documents (list of list): A list of documents, where each document is a list of words.

    Returns:
        float: The IDF value for the word across the documents.
    """
    # Count the number of documents containing the word
    num_docs_containing_word = sum(1 for doc in documents if word in doc)

    # Add 1 to the denominator to avoid division by zero if a word is not in any document
    # This is a common practice, sometimes called "smooth IDF"
    # The formula used here is simpler and matches the one provided in the prompt.
    # If num_docs_containing_word is 0, this will raise a ZeroDivisionError.
    # In a real-world scenario, you might handle this case (e.g., return 0 or a very large number).
    # Based on the provided code, we assume the word will be in at least one document for calculation.

    return math.log(len(documents) / num_docs_containing_word)

# Example usage (using the same documents as before):
# doc1 = "I love NLP NLP".split()
# doc2 = "NLP loves Python".split()
# all_docs = [doc1, doc2] # Already defined in the previous cell

print(f"IDF for 'NLP': {compute_idf('NLP', all_docs):.3f}") # Appears in both docs
print(f"IDF for 'love': {compute_idf('love', all_docs):.3f}") # Appears only in doc1
print(f"IDF for 'Python': {compute_idf('Python', all_docs):.3f}") # Appears only in doc2

IDF for 'NLP': 0.000
IDF for 'love': 0.693
IDF for 'Python': 0.693


### TF-IDF Calculation

The **TF-IDF** score for a word in a document is simply the product of its Term Frequency (TF) and its Inverse Document Frequency (IDF).

The formula for TF-IDF is:

$$ TF-IDF(t, d, D) = TF(t, d) \times IDF(t, D) $$

Where:
*   $TF(t, d)$ is the Term Frequency of term $t$ in document $d$.
*   $IDF(t, D)$ is the Inverse Document Frequency of term $t$ across the collection of documents $D$.

A high TF-IDF score means that the word is frequent in the specific document but rare in the overall collection of documents. This suggests that the word is likely important and relevant to the topic of that document.

Let's look at the Python code provided to calculate TF-IDF using the `compute_tf` and `compute_idf` functions we defined earlier.

In [7]:
def compute_tf_idf(word, document, documents):
    """
    Computes the TF-IDF score for a given word in a document within a collection of documents.

    Args:
        word (str): The word to compute TF-IDF for.
        document (list): A list of words representing the document.
        documents (list of list): A list of documents, where each document is a list of words.

    Returns:
        float: The TF-IDF value for the word in the document.
    """
    tf = compute_tf(word, document)
    idf = compute_idf(word, documents)
    return tf * idf

# Example documents (already defined in previous cells):
# doc1 = "I love NLP NLP".split()
# doc2 = "NLP loves Python".split()
# all_docs = [doc1, doc2]

# Compute TF-IDF for each word in each document using the provided example code
for doc_index, doc in enumerate(all_docs):
    print(f"Document {doc_index + 1}:")
    # Use set(doc) to process each unique word in the document
    for word in set(doc):
        tf_idf_value = compute_tf_idf(word, doc, all_docs)
        print(f"  {word}: {tf_idf_value:.3f}")

Document 1:
  I: 0.173
  NLP: 0.000
  love: 0.173
Document 2:
  Python: 0.231
  NLP: 0.000
  loves: 0.231


### Scikit-learn Bag of Words

Manually calculating TF-IDF can be tedious for large collections of documents. Libraries like `scikit-learn` provide efficient tools to do this automatically.

First, let's look at how `scikit-learn` can be used to create a **Bag of Words** representation using the `CountVectorizer`. As we discussed earlier, Bag of Words represents text by the frequency of words, ignoring word order.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences (from the provided code):
documents = [
    "I love NLP.",
    "NLP loves Python."
]

# Create BoW representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# The vectorizer learns the vocabulary and creates the Bag of Words matrix.
# Let's see the vocabulary it learned:
print("Vocabulary:", vectorizer.get_feature_names_out())

# And here is the Bag of Words matrix, showing word counts for each document:
print("BoW Matrix:\n", X.toarray())

Vocabulary: ['love' 'loves' 'nlp' 'python']
BoW Matrix:
 [[1 0 1 0]
 [0 1 1 1]]


### Scikit-learn TF-IDF

`scikit-learn` also provides a convenient class specifically for calculating TF-IDF: `TfidfVectorizer`. This vectorizer combines the steps of `CountVectorizer` (tokenizing and counting) and the TF-IDF calculation into a single process.

It calculates the TF-IDF score for each term in each document within the corpus.

Let's look at the code provided to use `TfidfVectorizer`.

First, you need to import `TfidfVectorizer`:

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

Then, you initialize `TfidfVectorizer` and fit it to your documents, similar to how you used `CountVectorizer`.

In [10]:
# Sample sentences (from the provided code):
documents = [
    "I love NLP.",
    "NLP loves Python."
]

# Create TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# The vectorizer learns the vocabulary and computes the TF-IDF matrix.
# You can see the vocabulary it learned (which is the same as with CountVectorizer for this example):
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

# And here is the TF-IDF matrix:
# The values are the TF-IDF scores for each word in each document.
print("TF-IDF Matrix:\n", X_tfidf.toarray())

Vocabulary: ['love' 'loves' 'nlp' 'python']
TF-IDF Matrix:
 [[0.81480247 0.         0.57973867 0.        ]
 [0.         0.6316672  0.44943642 0.6316672 ]]


### Computing Document Similarity using Cosine Similarity

Once we have the TF-IDF matrix representing our documents, we can compute the similarity between any two documents. A common and effective measure for this is **Cosine Similarity**.

Cosine similarity measures the cosine of the angle between two non-zero vectors. In the context of text analysis, the vectors are the TF-IDF vectors of the documents.

*   If the angle between the vectors is 0 degrees, the cosine similarity is 1, indicating the documents are identical in content.
*   If the angle is 90 degrees, the cosine similarity is 0, indicating the documents have no words in common.
*   Values between 0 and 1 indicate varying degrees of similarity.

The formula for cosine similarity between two vectors A and B is:

$$ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||} $$

Where:
*   $A \cdot B$ is the dot product of vectors A and B.
*   $||A||$ and $||B||$ are the magnitudes (or Euclidean norms) of vectors A and B.

Fortunately, `scikit-learn` provides a convenient function to compute cosine similarity directly from the TF-IDF matrix.

Let's add code to compute the cosine similarity between our two example documents using the TF-IDF matrix we generated.

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

# We use the TF-IDF matrix X_tfidf that we computed in the previous step.
# Each row in X_tfidf represents a document.
# X_tfidf.toarray() converts the sparse matrix to a dense numpy array.
# The cosine_similarity function can compute the similarity between all pairs of rows in the matrix.
similarity_matrix = cosine_similarity(X_tfidf)

# The result is a matrix where entry (i, j) is the cosine similarity between document i and document j.
print("Cosine Similarity Matrix:")
print(similarity_matrix)

# To get the similarity between the first document (index 0) and the second document (index 1):
similarity_doc1_doc2 = similarity_matrix[0, 1]
print(f"\nCosine Similarity between Document 1 and Document 2: {similarity_doc1_doc2:.3f}")

# Note that the diagonal elements (i, i) are always 1, as a document is perfectly similar to itself.
# The matrix is also symmetric (similarity_matrix[i, j] is the same as similarity_matrix[j, i]).

Cosine Similarity Matrix:
[[1.         0.26055567]
 [0.26055567 1.        ]]

Cosine Similarity between Document 1 and Document 2: 0.261
