# Bag of Words (BoW) in NLP 

The **Bag of Words (BoW)** model is one of the simplest and most effective methods for text representation in NLP. 
This notebook will guide you through its concept, implementation, customization, advantages, and limitations.

## **1. What is Bag of Words?**

The **Bag of Words** model represents text as a collection of unique words and their corresponding frequencies in a document or corpus, ignoring grammar, order, or structure.

### **How It Works:**
1. Tokenize the text into words.
2. Create a vocabulary of all unique words.
3. Represent each document as a vector where each element corresponds to the frequency of a word in the document.

### **Example:**

For two sentences:
- Sentence 1: "The cat sat on the mat"
- Sentence 2: "The dog lay on the rug"

Vocabulary: `['The', 'cat', 'sat', 'on', 'the', 'mat', 'dog', 'lay', 'rug']`

BoW representation:
- Sentence 1: `[2, 1, 1, 1, 0, 1, 0, 0, 0]`
- Sentence 2: `[2, 0, 0, 1, 0, 0, 1, 1, 1]`


## **2. Why Use Bag of Words?**

BoW is simple yet powerful for tasks like:
- Text classification
- Sentiment analysis
- Information retrieval


## **3. Implementing Bag of Words in Python**

### **Step 1: Import Libraries and Create a Dataset**

In [1]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample dataset
data = {
    "Sentences": [
        "The cat sat on the mat",
        "The dog lay on the rug",
        "The cat chased the dog"
    ]
}

df = pd.DataFrame(data)
print("Dataset:")
print(df)

Dataset:
                Sentences
0  The cat sat on the mat
1  The dog lay on the rug
2  The cat chased the dog


### **Step 2: Tokenize and Vectorize Using `CountVectorizer`**

In [None]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['Sentences'])

print("Vocabulary:", vectorizer.get_feature_names_out())

print("BoW Representation:")
print(X.toarray())

Vocabulary: ['cat' 'chased' 'dog' 'lay' 'mat' 'on' 'rug' 'sat' 'the']
BoW Representation:
[[1 0 0 0 1 1 0 1 2]
 [0 0 1 1 0 1 1 0 2]
 [1 1 1 0 0 0 0 0 2]]


### **Step 3: Convert BoW into a DataFrame for Better Visualization**

In [3]:
# Convert BoW to a DataFrame
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print("Bag of Words DataFrame:")
print(bow_df)

Bag of Words DataFrame:
   cat  chased  dog  lay  mat  on  rug  sat  the
0    1       0    0    0    1   1    0    1    2
1    0       0    1    1    0   1    1    0    2
2    1       1    1    0    0   0    0    0    2


## **4. Customizing Bag of Words**

You can customize `CountVectorizer` to suit specific needs.

### **(a) Removing Stop Words**

In [4]:
# Remove stop words
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['Sentences'])
print("Vocabulary without Stop Words:", vectorizer.get_feature_names_out())

Vocabulary without Stop Words: ['cat' 'chased' 'dog' 'lay' 'mat' 'rug' 'sat']


### **(b) Limiting Vocabulary Size**

In [5]:
# Limit vocabulary size to top 5 words
vectorizer = CountVectorizer(max_features=5)
X = vectorizer.fit_transform(df['Sentences'])
print("Top 5 Vocabulary:", vectorizer.get_feature_names_out())

Top 5 Vocabulary: ['cat' 'dog' 'lay' 'on' 'the']


### **(c) Using N-grams**

In [6]:
# Use unigrams and bigrams
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(df['Sentences'])
print("Vocabulary with N-grams:", vectorizer.get_feature_names_out())

Vocabulary with N-grams: ['cat' 'cat chased' 'cat sat' 'chased' 'chased the' 'dog' 'dog lay' 'lay'
 'lay on' 'mat' 'on' 'on the' 'rug' 'sat' 'sat on' 'the' 'the cat'
 'the dog' 'the mat' 'the rug']


## **5. Advantages of Bag of Words**

1. **Simplicity**: Easy to implement and understand.
2. **Good for Text Classification**: Works well with algorithms like Naive Bayes, Logistic Regression, and SVM.
3. **Sparse Representation**: Efficient for small datasets.


## **6. Limitations of Bag of Words**

1. **Ignores Context**: Fails to capture relationships between words (e.g., "not happy" and "happy" are treated separately).
2. **High Dimensionality**: For large vocabularies, the feature space becomes very large.
3. **Sensitive to Vocabulary**: Depends heavily on preprocessing (e.g., removing stop words).


## **7. Conclusion**

The **Bag of Words** model is a simple yet effective way to represent text data for many NLP tasks. 
While it has its limitations, it can be improved by combining it with techniques like TF-IDF or word embeddings for better performance in advanced applications.

### **Next Steps:**
- Experiment with preprocessing techniques like stemming and lemmatization.
- Combine BoW with Term Frequency-Inverse Document Frequency (TF-IDF) for weighting important words.