# TF-IDF: Term Frequency - Inverse Document Frequency

## What is TF-IDF?

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection of documents.

**Formula:**
```
TF-IDF(term, document) = TF(term, document) √ó IDF(term)
```

Where:
- **TF (Term Frequency)**: How often a term appears in a document
- **IDF (Inverse Document Frequency)**: How rare/common a term is across all documents

**Key Insight:** Words that appear frequently in one document but rarely in others get high TF-IDF scores!

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Styling
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Libraries loaded successfully!")

---
## 1. Sample Documents

Let's use a simple corpus of documents about different topics.

In [None]:
# Sample corpus: 4 documents about different topics
documents = [
    "The cat sat on the mat. The cat was happy.",
    "The dog played in the garden. The dog loves to run.",
    "Machine learning is a subset of artificial intelligence. Machine learning algorithms learn from data.",
    "Python is a programming language. Python is popular for data science and machine learning."
]

doc_names = ['Doc 1: Cat', 'Doc 2: Dog', 'Doc 3: ML', 'Doc 4: Python']

# Display documents
print("üìö Our Document Corpus:\n")
for i, (name, doc) in enumerate(zip(doc_names, documents), 1):
    print(f"{name}:")
    print(f"  '{doc}'")
    print()

---
## 2. Step 1: Term Frequency (TF)

**Term Frequency** measures how often a term appears in a document.

Formula:
```
TF(term, doc) = (Number of times term appears in doc) / (Total number of terms in doc)
```

In [None]:
# Calculate Term Frequency for Document 1
doc1 = documents[0].lower()
words = doc1.split()

# Count word occurrences
word_counts = Counter(words)
total_words = len(words)

print(f"Document 1: '{documents[0]}'\n")
print(f"Total words: {total_words}")
print(f"\nWord counts:")
for word, count in word_counts.most_common():
    tf = count / total_words
    print(f"  '{word}': appears {count} times ‚Üí TF = {count}/{total_words} = {tf:.3f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts
words_list = list(word_counts.keys())
counts_list = list(word_counts.values())
axes[0].barh(words_list, counts_list, color='steelblue', alpha=0.7)
axes[0].set_xlabel('Count', fontsize=12)
axes[0].set_title('Raw Word Counts (Document 1)', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()

# Term Frequency
tf_list = [c / total_words for c in counts_list]
axes[1].barh(words_list, tf_list, color='coral', alpha=0.7)
axes[1].set_xlabel('Term Frequency', fontsize=12)
axes[1].set_title('Term Frequency (Normalized)', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

print("\n‚úì TF normalizes word counts by document length!")

---
## 3. Step 2: Inverse Document Frequency (IDF)

**IDF** measures how rare or common a term is across all documents.

Formula:
```
IDF(term) = log(Total number of documents / Number of documents containing term)
```

**Key:** 
- Common words (appear in many documents) ‚Üí Low IDF
- Rare words (appear in few documents) ‚Üí High IDF

In [None]:
# Calculate IDF for all unique words
from collections import defaultdict

# Get all unique words and document frequency
doc_frequency = defaultdict(int)
all_words = set()

for doc in documents:
    words_in_doc = set(doc.lower().split())
    all_words.update(words_in_doc)
    for word in words_in_doc:
        doc_frequency[word] += 1

# Calculate IDF
n_documents = len(documents)
idf_scores = {}

for word in all_words:
    idf = np.log(n_documents / doc_frequency[word])
    idf_scores[word] = idf

# Sort by IDF score
sorted_idf = sorted(idf_scores.items(), key=lambda x: x[1], reverse=True)

# Display results
print(f"Total documents: {n_documents}\n")
print("IDF Scores (sorted by rarity):\n")
print(f"{'Word':<20} {'Appears in Docs':<20} {'IDF Score':<15}")
print("-" * 60)

for word, idf in sorted_idf[:15]:  # Show top 15
    doc_count = doc_frequency[word]
    print(f"{word:<20} {doc_count}/{n_documents} documents{' ':<8} {idf:.4f}")

# Visualize IDF scores
top_words = [w for w, _ in sorted_idf[:12]]
top_idfs = [idf_scores[w] for w in top_words]
colors = ['red' if doc_frequency[w] == 1 else 'orange' if doc_frequency[w] == 2 else 'green' 
          for w in top_words]

plt.figure(figsize=(12, 6))
bars = plt.bar(range(len(top_words)), top_idfs, color=colors, alpha=0.7)
plt.xticks(range(len(top_words)), top_words, rotation=45, ha='right')
plt.ylabel('IDF Score', fontsize=12)
plt.title('Inverse Document Frequency (IDF) Scores', fontsize=14, fontweight='bold')
plt.axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='red', alpha=0.7, label='Appears in 1 doc (rare)'),
    Patch(facecolor='orange', alpha=0.7, label='Appears in 2 docs'),
    Patch(facecolor='green', alpha=0.7, label='Appears in 3+ docs (common)')
]
plt.legend(handles=legend_elements, loc='upper right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\n‚úì Rare words get higher IDF scores!")
print("‚úì Common words (like 'the') get lower IDF scores!")

---
## 4. Step 3: TF-IDF = TF √ó IDF

Now we combine Term Frequency and Inverse Document Frequency:

```
TF-IDF(term, doc) = TF(term, doc) √ó IDF(term)
```

This gives us a score that is:
- **High** when a term appears frequently in a document but rarely in others
- **Low** when a term appears in many documents (common words)

In [None]:
# Manual TF-IDF calculation for Document 1
doc1_words = documents[0].lower().split()
doc1_word_counts = Counter(doc1_words)
doc1_total_words = len(doc1_words)

print("üìä TF-IDF Calculation for Document 1:\n")
print(f"{'Word':<15} {'TF':<10} {'√ó':<5} {'IDF':<10} {'=':<5} {'TF-IDF':<10}")
print("-" * 65)

doc1_tfidf = {}
for word in set(doc1_words):
    tf = doc1_word_counts[word] / doc1_total_words
    idf = idf_scores[word]
    tfidf = tf * idf
    doc1_tfidf[word] = tfidf
    print(f"{word:<15} {tf:<10.4f} √ó {idf:<10.4f} = {tfidf:<10.4f}")

# Sort by TF-IDF score
sorted_tfidf = sorted(doc1_tfidf.items(), key=lambda x: x[1], reverse=True)

print("\nüéØ Words ranked by importance (TF-IDF) in Document 1:")
for i, (word, score) in enumerate(sorted_tfidf, 1):
    print(f"{i}. '{word}': {score:.4f}")

---
## 5. TF-IDF with scikit-learn

In practice, we use scikit-learn's `TfidfVectorizer` for efficient computation.

In [None]:
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(lowercase=True, stop_words=None)

# Fit and transform documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert to DataFrame for easier viewing
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=feature_names,
    index=doc_names
)

print("üìä TF-IDF Matrix (all documents):\n")
print(tfidf_df.round(3))
print(f"\nShape: {tfidf_df.shape} (4 documents √ó {len(feature_names)} unique words)")

---
## 6. Visualizing TF-IDF Scores

In [None]:
# Heatmap of TF-IDF scores
plt.figure(figsize=(16, 6))

# Select top words by max TF-IDF score
top_n = 15
max_scores = tfidf_df.max(axis=0)
top_words = max_scores.nlargest(top_n).index
tfidf_subset = tfidf_df[top_words]

sns.heatmap(tfidf_subset.T, annot=True, fmt='.3f', cmap='YlOrRd', 
            cbar_kws={'label': 'TF-IDF Score'}, linewidths=0.5)
plt.title(f'TF-IDF Heatmap: Top {top_n} Important Words Across Documents', 
          fontsize=14, fontweight='bold', pad=20)
plt.xlabel('Documents', fontsize=12)
plt.ylabel('Words', fontsize=12)
plt.tight_layout()
plt.show()

print("\nüìà Interpretation:")
print("  ‚Ä¢ Darker red = Higher TF-IDF score = More important/distinctive word")
print("  ‚Ä¢ Light yellow = Lower score = Common or less relevant word")
print("  ‚Ä¢ Notice how topic-specific words (cat, dog, machine, python) have high scores!")

In [None]:
# Top words per document
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, doc_name in enumerate(doc_names):
    # Get top 8 words for this document
    doc_scores = tfidf_df.loc[doc_name].sort_values(ascending=False).head(8)
    
    # Filter out zero scores
    doc_scores = doc_scores[doc_scores > 0]
    
    # Plot
    colors = plt.cm.RdYlGn_r(doc_scores / doc_scores.max())
    axes[idx].barh(range(len(doc_scores)), doc_scores.values, color=colors)
    axes[idx].set_yticks(range(len(doc_scores)))
    axes[idx].set_yticklabels(doc_scores.index)
    axes[idx].set_xlabel('TF-IDF Score', fontsize=10)
    axes[idx].set_title(f'{doc_name}', fontsize=12, fontweight='bold')
    axes[idx].invert_yaxis()
    axes[idx].grid(axis='x', alpha=0.3)

plt.suptitle('Most Important Words per Document (TF-IDF)', 
             fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

---
## 7. Comparison: With vs. Without Stop Words

Stop words are common words (like "the", "is", "a") that usually don't carry much meaning. Let's see the effect of removing them.

In [None]:
# TF-IDF without stop words
vectorizer_no_stop = TfidfVectorizer(lowercase=True, stop_words='english')
tfidf_matrix_no_stop = vectorizer_no_stop.fit_transform(documents)

feature_names_no_stop = vectorizer_no_stop.get_feature_names_out()
tfidf_df_no_stop = pd.DataFrame(
    tfidf_matrix_no_stop.toarray(),
    columns=feature_names_no_stop,
    index=doc_names
)

# Comparison
print("üìä Comparison: With vs. Without Stop Words\n")
print(f"With stop words:    {len(feature_names)} unique words")
print(f"Without stop words: {len(feature_names_no_stop)} unique words")
print(f"Reduction:          {len(feature_names) - len(feature_names_no_stop)} words removed\n")

# Show removed words (stop words)
stop_words_removed = set(feature_names) - set(feature_names_no_stop)
print(f"Stop words removed: {sorted(stop_words_removed)}")

# Visualize comparison for Document 3 (ML document)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# With stop words
doc3_with = tfidf_df.loc['Doc 3: ML'].sort_values(ascending=False).head(10)
doc3_with = doc3_with[doc3_with > 0]
axes[0].barh(range(len(doc3_with)), doc3_with.values, color='steelblue', alpha=0.7)
axes[0].set_yticks(range(len(doc3_with)))
axes[0].set_yticklabels(doc3_with.index)
axes[0].set_xlabel('TF-IDF Score')
axes[0].set_title('With Stop Words', fontsize=12, fontweight='bold')
axes[0].invert_yaxis()

# Without stop words
doc3_without = tfidf_df_no_stop.loc['Doc 3: ML'].sort_values(ascending=False).head(10)
doc3_without = doc3_without[doc3_without > 0]
axes[1].barh(range(len(doc3_without)), doc3_without.values, color='coral', alpha=0.7)
axes[1].set_yticks(range(len(doc3_without)))
axes[1].set_yticklabels(doc3_without.index)
axes[1].set_xlabel('TF-IDF Score')
axes[1].set_title('Without Stop Words', fontsize=12, fontweight='bold')
axes[1].invert_yaxis()

plt.suptitle('Top Words in ML Document (Doc 3)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n‚úì Removing stop words helps focus on meaningful content words!")

---
## 8. Real-World Application: Document Similarity

TF-IDF is often used to compute document similarity. Similar documents will have similar TF-IDF vectors.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between all documents
similarity_matrix = cosine_similarity(tfidf_matrix_no_stop)

# Create DataFrame
similarity_df = pd.DataFrame(
    similarity_matrix,
    index=doc_names,
    columns=doc_names
)

print("üìä Document Similarity Matrix (Cosine Similarity):\n")
print(similarity_df.round(3))
print("\n1.0 = Identical documents, 0.0 = Completely different documents")

# Visualize
plt.figure(figsize=(8, 6))
mask = np.triu(np.ones_like(similarity_matrix, dtype=bool), k=1)
sns.heatmap(similarity_df, annot=True, fmt='.3f', cmap='coolwarm', 
            center=0.5, vmin=0, vmax=1, mask=mask,
            square=True, linewidths=1, cbar_kws={'label': 'Similarity'})
plt.title('Document Similarity Based on TF-IDF', fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

print("\nüîç Insights:")
print("  ‚Ä¢ Doc 3 (ML) and Doc 4 (Python) are most similar - both discuss tech topics")
print("  ‚Ä¢ Doc 1 (Cat) and Doc 2 (Dog) have some similarity - both about animals")
print("  ‚Ä¢ Animal docs and tech docs are quite different from each other")

---
## 9. Summary: When to Use TF-IDF

### ‚úÖ Use TF-IDF for:
- **Text classification** - converting text to numerical features
- **Information retrieval** - finding relevant documents
- **Document clustering** - grouping similar documents
- **Keyword extraction** - identifying important terms
- **Document comparison** - measuring similarity

### üîë Key Takeaways:
1. **TF-IDF highlights distinctive words** - words that are frequent in a document but rare across the corpus
2. **Common words get low scores** - words like "the", "is", "a" are downweighted
3. **Better than raw counts** - considers both local (TF) and global (IDF) importance
4. **Foundation for NLP** - widely used in search engines, recommendation systems, and text analysis

### ‚ö†Ô∏è Limitations:
- Ignores word order and context
- Doesn't capture semantic meaning
- For advanced tasks, consider: Word2Vec, BERT, or other embedding methods

In [None]:
# Final visualization: TF-IDF concept summary
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# TF example
axes[0].bar(['Word A', 'Word B', 'Word C'], [5, 2, 1], color='skyblue', alpha=0.7)
axes[0].set_title('Term Frequency (TF)\nHow often in THIS document?', 
                  fontsize=11, fontweight='bold')
axes[0].set_ylabel('Count in Document')
axes[0].grid(axis='y', alpha=0.3)

# IDF example
axes[1].bar(['Word A\n(appears in\n3/4 docs)', 
             'Word B\n(appears in\n2/4 docs)', 
             'Word C\n(appears in\n1/4 docs)'], 
            [0.3, 0.7, 1.4], color='orange', alpha=0.7)
axes[1].set_title('Inverse Document Frequency (IDF)\nHow rare across ALL documents?', 
                  fontsize=11, fontweight='bold')
axes[1].set_ylabel('IDF Score')
axes[1].grid(axis='y', alpha=0.3)

# TF-IDF example
tfidf_values = [5*0.3, 2*0.7, 1*1.4]
colors_final = ['lightcoral', 'gold', 'lightgreen']
bars = axes[2].bar(['Word A', 'Word B', 'Word C'], tfidf_values, 
                   color=colors_final, alpha=0.7)
axes[2].set_title('TF-IDF = TF √ó IDF\nFinal importance score', 
                  fontsize=11, fontweight='bold')
axes[2].set_ylabel('TF-IDF Score')
axes[2].grid(axis='y', alpha=0.3)

# Add value labels
for bar, val in zip(bars, tfidf_values):
    height = bar.get_height()
    axes[2].text(bar.get_x() + bar.get_width()/2., height,
                f'{val:.2f}', ha='center', va='bottom', fontweight='bold')

plt.suptitle('Understanding TF-IDF: The Complete Picture', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nüéØ Remember: TF-IDF = Term Frequency √ó Inverse Document Frequency")
print("   ‚Üí Highlights words that are important to a specific document!")