## 🔍 What is TF-IDF?

**TF-IDF** stands for **Term Frequency–Inverse Document Frequency**. It is a statistical measure used to evaluate how important a word is to a document in a collection (corpus).

It combines two ideas:
- **TF (Term Frequency):** How often a word appears in a document or sentence.
   
$$
TF = \frac{\text{Number of words in the sentence}}{\text{Number of words in the sentence}}
$$


- **IDF (Inverse Document Frequency):** How rare that word is across all documents or sentences.

$$
IDF = \log_e\left(\frac{\text{Number of sentences}}{\text{Number of sentences containing the word}}\right)
$$

## 🧠 TF-IDF Example (Step-by-Step)

Let's take 3 simple sentences:

- **S1**: "The cat sat on the mat"
- **S2**: "The dog sat on the log"
- **S3**: "The cat chased the dog"

---

### 🔄 Step 1: Preprocessing

After converting to lowercase and removing common stop words (like "the", "on"):

- **S1** → `cat sat mat`
- **S2** → `dog sat log`
- **S3** → `cat chased dog`

---

### 📊 Step 2: Term Frequency (TF)

We calculate TF for each word in each sentence as:

$$
TF = \frac{\text{count of the word in sentence}}{\text{total words in sentence}}
$$

| Word     | S1 (cat sat mat) | S2 (dog sat log) | S3 (cat chased dog) |
|----------|------------------|------------------|---------------------|
| cat      | 1/3              | 0                | 1/3                 |
| dog      | 0                | 1/3              | 1/3                 |
| sat      | 1/3              | 1/3              | 0                   |
| mat      | 1/3              | 0                | 0                   |
| log      | 0                | 1/3              | 0                   |
| chased   | 0                | 0                | 1/3                 |

---

### 📈 Step 3: Inverse Document Frequency (IDF)

We now calculate IDF without smoothing:

$$
IDF = \log_e\left(\frac{\text{Total number of sentences}}{\text{Number of sentences containing the word}}\right)
$$

| Word   | Document Frequency | IDF (rounded)     |
|--------|--------------------|------------------|
| cat    | 2                  | log(3 / 2) ≈ 0.405 |
| dog    | 2                  | log(3 / 2) ≈ 0.405 |
| sat    | 2                  | log(3 / 2) ≈ 0.405 |
| mat    | 1                  | log(3 / 1) ≈ 1.099 |
| log    | 1                  | log(3 / 1) ≈ 1.099 |
| chased | 1                  | log(3 / 1) ≈ 1.099 |

---

### 🧮 Step 4: Calculating TF-IDF

Now that we have both **TF** and **IDF**, we multiply them to get the **TF-IDF score** for each word in each sentence:

$$
\text{TF-IDF} = TF \times IDF
$$

---

### ✅ Final TF-IDF Matrix

| Word     | S1 (cat sat mat) | S2 (dog sat log) | S3 (cat chased dog) |
|----------|------------------|------------------|---------------------|
| cat      | 1/3 × 0.405 ≈ 0.135 | 0 × 0.405 = 0      | 1/3 × 0.405 ≈ 0.135  |
| dog      | 0 × 0.405 = 0     | 1/3 × 0.405 ≈ 0.135 | 1/3 × 0.405 ≈ 0.135  |
| sat      | 1/3 × 0.405 ≈ 0.135 | 1/3 × 0.405 ≈ 0.135 | 0 × 0.405 = 0        |
| mat      | 1/3 × 1.099 ≈ 0.366 | 0                   | 0                   |
| log      | 0                  | 1/3 × 1.099 ≈ 0.366  | 0                   |
| chased   | 0                  | 0                   | 1/3 × 1.099 ≈ 0.366  |


---

### 🔎 Interpretation

- Words like **"mat"**, **"log"**, and **"chased"** have higher TF-IDF scores because they appear in only **one document** → making them more **informative**.
- Words like **"cat"**, **"dog"**, **"sat"** appear in multiple documents → hence have **lower scores**.
- TF-IDF helps emphasize **unique, meaningful words** and downplay **frequent, generic words**.




### 💡 Why use TF-IDF?
While Bag of Words (BoW) treats all words equally, **TF-IDF gives more weight to rare and meaningful words**, and reduces the impact of common words that appear in many documents.

It is widely used in:
- Document classification
- Information retrieval (e.g., search engines)
- Text similarity and clustering