# ðŸ“Œ Topic: TF-IDF Vectorization

### What you will learn
- What TF-IDF is and how it improves upon the Bag of Words model
- Mathematical concept behind Term Frequency (TF) and Inverse Document Frequency (IDF)
- Practical implementation using Scikit-learn's `TfidfVectorizer`
- How to interpret the importance of words in a corpus

### Why this matters
In simple frequency counts (BoW), common words like "the", "is", and "and" often dominate the feature set but carry very little informative value. **TF-IDF** (Term Frequency-Inverse Document Frequency) solves this by penalizing common words and boosting words that are unique to specific documents. It is a cornerstone of search engines and information retrieval systems.

---

## The Math of Importance

TF-IDF is the product of two statistics:

1.  **Term Frequency (TF)**: How often a word appear in a single document. (Reward frequent words in a document)
2.  **Inverse Document Frequency (IDF)**: How common a word is across the *entire* collection of documents. (Penalize words that appear everywhere)

**Formula**: `TF-IDF = TF(word, document) * IDF(word, corpus)`

A high score means the word is frequent in the current document but rare in others, making it a strong "signature" word for that document.

In [None]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample toy corpus to illustrate weighting
corpus = [
    "Natural language processing is a subfield of linguistics and AI.",
    "Data science combines statistics, data analysis, and machine learning.",
    "Machine learning implies that the computer learns from data.",
    "TF-IDF stands for Term Frequency-Inverse Document Frequency.",
    "Text vectorization converts text into numerical vectors for AI models.",
    "Python is excellent for data science and natural language processing."
]

## Step 1: Initialize the TfidfVectorizer

Similarly to `CountVectorizer`, the `TfidfVectorizer` handles tokenization and calculation automatically.

In [None]:
# Initialize the vectorizer
tfidf_vec = TfidfVectorizer()

# Fit and learn the importance weights
tfidf_fit = tfidf_vec.fit_transform(corpus)

## Step 2: Visualization

Let's look at the scores. Note how words that appear in many sentences (like "and", "data", "learning") get lower weights compared to unique words like "linguistics" or "python".

In [None]:
# Convert the scores into a DataFrame for readability
tfidf_df = pd.DataFrame(tfidf_fit.toarray(), columns=tfidf_vec.get_feature_names_out())

# Display the matrix
tfidf_df

## Key Takeaways

1.  **Contextual Importance**: Unlike Bag of Words, TF-IDF understands that some words are naturally more informative than others.
2.  **Automatic Noise Reduction**: It naturally down-weights common "noise" words without always needing a hard-coded stopword list.
3.  **Vector Comparison**: Documents with similar TF-IDF vectors are likely about similar subjects.

## Next steps:
- Use these TF-IDF vectors as input for **Text Classification** models (Logistic Regression, SVM).
- Explore **N-grams** with TF-IDF to capture multi-word phrases (like "machine learning" or "data science").