<a href="https://colab.research.google.com/github/Ehtisham1053/Natural-Language-Processing/blob/main/Bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Bag of Words (BoW) in NLP
The Bag of Words (BoW) model represents text data as numerical features. It counts the occurrences of words in a document and creates a feature matrix. Unlike One-Hot Encoding, BoW captures word frequency, making it useful for text classification and NLP tasks.

## Step-by-Step Implementation using DataFrame (DF)
We will:

1. Define a text corpus
2. Tokenize and preprocess the text
3. Create a vocabulary
4. Generate a BoW matrix using Pandas DataFrame for better visualization

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


## Step 2: Define a Sample Corpus

In [2]:
corpus = [
    "I love NLP and Machine Learning",
    "NLP is amazing and fun",
    "Machine Learning is powerful"
]


### 📌 Explanation:

We have 3 documents (sentences) in our corpus.
Our goal is to convert these sentences into a Bag of Words representation.

## Step 3: Apply Bag of Words using CountVectorizer

In [3]:
vectorizer = CountVectorizer()  # Initialize the BoW model
X = vectorizer.fit_transform(corpus)  # Transform text into a BoW matrix

In [7]:
vectorizer.get_feature_names_out()

array(['amazing', 'and', 'fun', 'is', 'learning', 'love', 'machine',
       'nlp', 'powerful'], dtype=object)

In [9]:
# jsut for the sake of the example
vectorizer.transform(['my name is amazing']).toarray()

array([[1, 0, 0, 1, 0, 0, 0, 0, 0]])

## Step 4: Convert to DataFrame for Better Understanding

In [4]:
vocab = vectorizer.get_feature_names_out()
df = pd.DataFrame(X.toarray(), columns=vocab)

df

Unnamed: 0,amazing,and,fun,is,learning,love,machine,nlp,powerful
0,0,1,0,0,1,1,1,1,0
1,1,1,1,1,0,0,0,1,0
2,0,0,0,1,1,0,1,0,1


## Understanding the Output
* Rows represent documents (sentences).
* Columns represent words in the vocabulary.
* Each cell contains the frequency of a word in the corresponding document.

In [6]:
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)


Vocabulary: ['amazing' 'and' 'fun' 'is' 'learning' 'love' 'machine' 'nlp' 'powerful']
