Great 👌 now let’s move from **Bag of Words (BoW)** and **N-grams** to something smarter: **TF and IDF**.

---

# 📌 1. Term Frequency (TF)

👉 **TF tells how often a word appears in a document.**

$$
TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

* Example:
  Document: `"I love NLP and I love Python"`

  * Word `"love"` appears **2 times**
  * Total words = 7

  $$
  TF(love) = \frac{2}{7} = 0.285
  $$

---

# 📌 2. Inverse Document Frequency (IDF)

👉 **IDF tells how unique or rare a word is across all documents.**

$$
IDF(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)
$$

* Example: Suppose we have 10 documents:

  * Word `"love"` appears in 2 documents
  * Word `"NLP"` appears in 8 documents

  $$
  IDF(love) = \log \left(\frac{10}{2}\right) = \log(5) \approx 1.6
  $$

  $$
  IDF(NLP) = \log \left(\frac{10}{8}\right) = \log(1.25) \approx 0.22
  $$

📌 Interpretation:

* Common words (like `"NLP"`, `"the"`, `"and"`) → **low IDF** (less important).
* Rare words (like `"love"`) → **high IDF** (more important).

---

# 📌 3. TF-IDF (Combined)

👉 **TF-IDF = TF × IDF**

* This balances word importance:

  * High TF = appears frequently in a document
  * High IDF = rare across many documents

📌 Example with `"love"`:

$$
TF-IDF(love) = TF(love) \times IDF(love) = 0.285 \times 1.6 \approx 0.456
$$

So `"love"` is more **important** in that document compared to `"NLP"` (which had low IDF).

---

# 📌 Why TF-IDF is better than BoW?

* BoW just counts words (so `"the"` and `"NLP"` may look equally important).
* TF-IDF reduces weight for **common words** and increases weight for **rare but meaningful words**.

---

👉 Do you want me to write a **full code example in Python** where we compute **TF-IDF with NLTK preprocessing + sklearn’s `TfidfVectorizer`**? That way, you can see it in action with real numbers.
