# TF-IDF Vectorizer

TF-IDF is a method used in NLP to convert text into numerical vectors. It gives importance to words that are more **unique** in a document and reduces the weight of very common words.

**TF (Term Frequency):** Measures how often a word appears in a document.  
  
$$
TF(t) = \frac{\text{Number of times term } t \text{ appears in a document}}{\text{Total number of terms in the document}}
$$

**IDF (Inverse Document Frequency):** Measures how important a word is across multiple documents.  
  
$$
IDF(t) = \log \frac{\text{Total number of documents}}{1 + \text{Number of documents containing term } t}
$$

**TF-IDF:** Combines both TF and IDF.  
  
$$
TFIDF(t) = TF(t) \times IDF(t)
$$

### Why TF-IDF is used

* Filters out common words like “the”, “is”, “and” that appear in almost all documents.
* Highlights words that carry more meaning for a document.
* Useful for **search engines**, **text classification**, **spam detection**, and **recommendation systems**.


### Step-by-Step TF-IDF Calculation 

Let's take three sample sentences:

```python
documents = [
    "I love pizza and pasta",
    "Pizza is my favorite food",
    "I love eating pasta"
]
```

**Step 1: Compute Term Frequency (TF)**
Count the occurrences of each word in a document and divide by total words in that document.

Example for document 1: `"I love pizza and pasta"`

* Total words: 5
* TF("I") = 1/5 = 0.2
* TF("love") = 1/5 = 0.2
* TF("pizza") = 1/5 = 0.2
* TF("and") = 1/5 = 0.2
* TF("pasta") = 1/5 = 0.2

**Step 2: Compute Inverse Document Frequency (IDF)**

* Count how many documents contain each word:

  * "I" → 2 documents
  * "love" → 2
  * "pizza" → 2
  * "and" → 1
  * "pasta" → 2
  * "is" → 1
  * "my" → 1
  * "favorite" → 1
  * "food" → 1
  * "eating" → 1

* Total documents = 3

$$
IDF(t) = \log\frac{3}{1 + df(t)}
$$

* Example:

  * IDF("I") = log(3 / (2+1)) = log(1) = 0
  * IDF("and") = log(3 / (1+1)) = log(1.5) ≈ 0.176

**Step 3: Multiply TF by IDF**
[
TFIDF("word") = TF("word") * IDF("word")
]

This gives the final TF-IDF vector for each document.

---



### Implementing TF-IDF using scikit-learn

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "I love pizza and pasta",
    "Pizza is my favorite food",
    "I love eating pasta"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2),stop_words='english')

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert TF-IDF matrix to array
tfidf_array = tfidf_matrix.toarray()

# Display TF-IDF values
import pandas as pd

df = pd.DataFrame(tfidf_array, columns=feature_names)
df

Unnamed: 0,eating,eating pasta,favorite,favorite food,food,love,love eating,love pizza,pasta,pizza,pizza favorite,pizza pasta
0,0.0,0.0,0.0,0.0,0.0,0.393511,0.0,0.51742,0.393511,0.393511,0.0,0.51742
1,0.0,0.0,0.467351,0.467351,0.467351,0.0,0.0,0.0,0.0,0.355432,0.467351,0.0
2,0.490479,0.490479,0.0,0.0,0.0,0.373022,0.490479,0.0,0.373022,0.0,0.0,0.0


You will get a table where rows represent documents and columns represent words. Each value is the TF-IDF score.


### Key Notes 

* TF-IDF reduces the weight of common words across documents.
* Words unique to a document get higher importance.
* It's widely used for text preprocessing in ML/NLP before applying classification or clustering.
* Can be used for **feature extraction** in models like Naive Bayes, Logistic Regression, and SVM.


### Tips

* Use `TfidfVectorizer(stop_words='english')` to remove common English words automatically.
* Can combine with **n-grams** to capture phrases: `TfidfVectorizer(ngram_range=(1,2))`.
* Always fit TF-IDF on training data only, then transform test data to avoid data leakage.

