# ðŸ“Œ Topic: Bag of Words (BoW) Model

### What you will learn
- What the Bag of Words model is and how it represents text numerically
- How to use `CountVectorizer` to create document-term matrices
- Understanding vocabulary building and word frequencies
- Limitations of the BoW approach

### Why this matters
Machine learning models can't process raw text directly. They need numbers. The **Bag of Words** model is one of the simplest and most common ways to convert text into numerical vectors that models can understand. It sets the stage for more advanced techniques like TF-IDF and word embeddings.

---

## What is Bag of Words?

The **Bag of Words** model represents text by ignoring grammar and word order, focusing only on the presence and frequency of words. Imagine throwing all the words from a document into a literal "bag"â€”you know which words are inside and how many of each there are, but you've lost the structure of the sentences.

### How it works:
1. **Tokenization**: Breaking the text into individual words.
2. **Vocabulary Building**: Collecting every unique word across all documents in your dataset.
3. **Vectorization**: Representing each document as a row where each column corresponds to a word in the vocabulary, and the values are the counts of those words in the document.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample dataset of sentences for demonstration
data = [
    "I love learning natural language processing",
    "Natural language processing is fun to learn",
    "I enjoy learning new things in machine learning",
    "Machine learning helps computers understand language",
    "I love using language models for learning"
]

## Step 1: Initialize the CountVectorizer

Scikit-learn provides `CountVectorizer` to handle the heavy lifting. It automatically tokenizes the text, builds the vocabulary, and creates the frequency counts.

In [None]:
# Initialize the vectorizer
countvec = CountVectorizer()

# Fit and transform the data
# 'Fit' learns the vocabulary; 'Transform' creates the document-term matrix
countvec_fit = countvec.fit_transform(data)

## Step 2: Inspecting the Matrix

Let's convert the sparse matrix into a readable DataFrame so we can see which words appear where.

In [None]:
# Create a DataFrame with column names matching the vocabulary terms
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns=countvec.get_feature_names_out())

# Let's add the original sentences for easier comparison
bag_of_words['ORIGINAL_SENTENCE'] = data

# Display the result
bag_of_words

## Key Takeaways

1. **Fixed Size**: Every document is represented by a vector of the same length (the size of the total vocabulary).
2. **Simple Counts**: Values indicate how many times a word appears in that specific document.
3. **Loss of Context**: Position and meaning (semantics) are discarded. "Dog bites man" and "Man bites dog" would look identical in basic BoW.
4. **Sparsity**: Most cells in the matrix will be zero, as most documents only use a small fraction of the total vocabulary.

## Next steps:
- Explore **Stopword Removal** to exclude common words like "is", "to", "for" that don't add much meaning.
- Move on to **TF-IDF Vectorization** to see how we can weight words by importance rather than just raw frequency.