## **Bag of Words (BoW)**

---

### **What is Bag of Words (BoW)?**

**Bag of Words** is a technique to represent text as **a set of word counts**.

* It **ignores grammar, word order, and meaning**.
* Just **counts how many times each word appears** in the text.

---

### **Why is BoW Needed?**

* Machine learning models can’t work with raw text — they need numbers.
* BoW provides a simple way to convert text into **fixed-length numeric vectors**.
* Useful for basic models and when meaning/context isn’t important.

---

### **How Does It Work?**

1. Build a **vocabulary** (list of unique words) from the text.
2. For each text/document, **count** how many times each word from the vocabulary appears.

---

### **Example:**

```text
Text 1: "I love cats"  
Text 2: "I love dogs"
```

**Vocabulary**: \["I", "love", "cats", "dogs"]

| Text          | I | love | cats | dogs |
| ------------- | - | ---- | ---- | ---- |
| "I love cats" | 1 | 1    | 1    | 0    |
| "I love dogs" | 1 | 1    | 0    | 1    |

Each row is now a **vector of word counts** — ready for the model!

---

###  **When to Use BoW?**

* When text **structure or context is not important**
* For **simple classification** or baseline models
* When you want **fast and easy feature extraction**

---

### **Limitations**

* **Ignores word meaning** (e.g., "happy" and "joyful" are unrelated)
* **Ignores word order**
* For large vocabularies, vectors become **long and sparse**

---

### **Summary**

| Aspect    | Details                                   |
| --------- | ----------------------------------------- |
| Purpose   | Convert text to word-count vectors        |
| Pros      | Simple, fast, easy to use                 |
| Cons      | No word meaning, large and sparse vectors |
| Use Cases | Text classification, spam detection, etc. |

---


### **How It Works?**

In [1]:
import pandas as pd

In [3]:
df = pd.DataFrame({'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'], 'output': [1, 1, 0, 0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv= CountVectorizer()

In [12]:
# Fit and transform the text
BOW= cv.fit_transform(df['text']) ## All Vocab found out

In [13]:
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [16]:
print(BOW[0].toarray())

[[1 0 1 1 0]]


In [17]:
print(BOW[2].toarray())

[[0 1 1 0 1]]


In [18]:
## Now What if u get a totally new sentance?

cv.transform(["campusx watch and write comment of campusx"]).toarray()

array([[2, 1, 0, 1, 1]])

### **Let's Try With a Dataset**

In [47]:
spam_classifier= pd.read_csv("spam.csv")

In [48]:
spam_classifier.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [49]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
cv = CountVectorizer(
    lowercase=True,
    stop_words='english',
    token_pattern='[a-zA-Z]+'
)

In [50]:
cv

In [51]:
BOW= cv.fit_transform(spam_classifier['Message']) ## All Vocab found out

In [52]:
print(cv.vocabulary_)



In [54]:
print(BOW[1].toarray())

[[0 0 0 ... 0 0 0]]
