# Bag of Words (BoW) and TF-IDF

**Objective:** Convert text data into numerical vectors that machine learning models can understand.

---
## Why Do We Need Feature Extraction?
Machine learning models can't process raw text. We need to represent words as **numbers**. Two popular methods are:

- **Bag of Words (BoW):** Counts the frequency of words.
- **TF-IDF (Term Frequency–Inverse Document Frequency):** Weighs words by importance across documents.

---
## Bag of Words (BoW)

BoW creates a vocabulary of all words in the dataset and counts occurrences in each document.

**Example:**

| Document | Text |
|-----------|------|
| 1 | I love machine learning |
| 2 | I love coding in Python |

→ Vocabulary: `[I, love, machine, learning, coding, in, Python]`

| Word | Doc1 | Doc2 |
|------|------|------|
| I | 1 | 1 |
| love | 1 | 1 |
| machine | 1 | 0 |
| learning | 1 | 0 |
| coding | 0 | 1 |
| in | 0 | 1 |
| Python | 0 | 1 |

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "I love machine learning",
    "I love coding in Python"
]

# Create a CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBoW Matrix:\n", X.toarray())

 **Interpretation:** Each document becomes a numerical vector based on word frequency.

### Pros
- Simple and intuitive.
- Works well for small datasets.

### Cons
- Ignores meaning and word order.
- Large vocabulary = high dimensionality.

---
## TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF improves BoW by reducing the weight of common words and increasing the weight of rare, informative words.

**Formula:**

\[ TF-IDF = TF × IDF \]

Where:
- **TF (Term Frequency):** How often a term appears in a document.
- **IDF (Inverse Document Frequency):** How rare a term is across all documents.

\[ IDF = log(\frac{N}{df}) \]
where *N* = total documents, *df* = number of documents containing the term.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(docs)

print("Vocabulary:", tfidf.get_feature_names_out())
print("\nTF-IDF Matrix:\n", X_tfidf.toarray())

✅ **TF-IDF gives higher importance to unique terms** (like *machine* or *Python*), while common terms (*I*, *love*) get lower scores.

---
## Visualizing Word Importance

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())
df_tfidf.T.plot(kind='bar', figsize=(8,4))
plt.title('TF-IDF Scores for Each Word')
plt.ylabel('Importance')
plt.xlabel('Words')
plt.show()

---
## Comparison Table

| Feature | Bag of Words | TF-IDF |
|----------|---------------|--------|
| Meaning | Word counts | Weighted by importance |
| Handles common words | ❌ No | ✅ Yes |
| Computational cost | Low | Medium |
| Works better for | Small datasets | Larger text corpora |

---
## Summary
- **BoW** represents text with raw word counts.
- **TF-IDF** adjusts weights based on word importance.
- Both are essential before applying ML algorithms like Naive Bayes or SVM.

---
 **Next:** `05-Word_Embeddings_Basics.ipynb` — Understand how to represent words as continuous vectors using embeddings!