## 🧺 Bag of Words (BoW)

**Bag of Words (BoW)** is a natural extension of **one-hot encoding** — instead of just representing the **presence or absence** of words, BoW also captures **how frequently** each word appears in a document.

Like one-hot encoding, it converts text into numerical vectors. But while one-hot represents each word individually, BoW treats the entire sentence or document as a single unit and builds a **frequency-based feature vector** using the words in a fixed vocabulary.

---

### 💡 What Problems Can BoW Help Solve?

BoW can be effectively used in:
- **Text classification** (e.g., spam detection, sentiment analysis)
- **Document similarity** or search ranking
- As input for traditional ML models like Naive Bayes, SVMs, or Logistic Regression

---

### 🧾 Step-by-Step Process:

Let’s say we have 3 sentences:
1. `"I love NLP"`  
2. `"NLP is fun"`  
3. `"I love fun"`

---

### ✅ Step 1: Lowercase All Words
Convert all text to lowercase for consistency:
- `"i love nlp"`
- `"nlp is fun"`
- `"i love fun"`

---

### 🚫 Step 2: Remove Stop Words
Stop words (like *is*, *the*, *and*) don’t carry much meaning. We remove them:
- `"i love nlp"` → `"love nlp"`
- `"nlp is fun"` → `"nlp fun"`
- `"i love fun"` → `"love fun"`

---

### 🔤 Step 3: Extract Unique Words
From all sentences, extract unique non-stop words (the vocabulary):

["love", "nlp", "fun"]

---

### 🔢 Step 4: Vocabulary with Frequencies (Descending Order)
Count how often each word appears across all documents:
- `"love"` → 3
- `"nlp"` → 2
- `"fun"` → 2

Sort by frequency:
["love", "nlp", "fun"]

---

### ✂️ Step 5: Use Top Features Only (Optional)
In real-world datasets with large vocabularies, we often select the **top N most frequent words** (e.g., top 10) and ignore the rest. This reduces dimensionality and noise.

---

### 🧮 Step 6: Create Document-Term Matrix

Now we build a matrix with:
- Rows = sentences/documents
- Columns = vocabulary words (features)
- Values = word frequency per sentence

| Sentence       | love | nlp | fun |
|----------------|------|-----|-----|
| "love nlp"     | 1    | 1   | 0   |
| "nlp fun"      | 0    | 1   | 1   |
| "love fun"     | 1    | 0   | 1   |

---

- **Frequency BoW**: Count how many times each word appears.
- **Binary BoW**: Use `1` if the word appears in the sentence, `0` if it doesn't — ignoring how many times it appears.

---

This simple but powerful technique allows us to feed raw text into classic machine learning models by converting it into a structured, numerical format.