<a href="https://colab.research.google.com/github/Ehtisham1053/Natural-Language-Processing/blob/main/Bag_of_words_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Bag of Words (BoW) in NLP
The Bag of Words (BoW) is a simple yet powerful text representation technique used in NLP. It converts textual data into numerical form so that machine learning models can process it.



##Understanding Bag of Words
Concept
* BoW represents text data as a collection (bag) of words without considering grammar or word order.

* It counts the occurrence of words and constructs a feature vector.

* It helps in text classification, sentiment analysis, spam detection, etc.

##Steps to Perform BoW
1. Tokenization: Split the text into words.

2. Remove Stopwords: Remove common words like "the", "is", "and", etc.

3. Lowercasing: Convert all words to lowercase for uniformity.

4. Stemming/Lemmatization (Optional): Reduce words to their root forms.

5. Build Vocabulary: Create a unique list of words from all documents.

6. Vectorization: Convert text into numerical form by counting word occurrences.



# "Performing BoW in Python using Pandas DataFrame

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


In [2]:
data = {'Text': [
    'Natural language processing is amazing',
    'Language models are powerful in NLP',
    'Processing text data is a crucial step in NLP',
    'NLP helps machines understand human language'
]}

df = pd.DataFrame(data)
df


Unnamed: 0,Text
0,Natural language processing is amazing
1,Language models are powerful in NLP
2,Processing text data is a crucial step in NLP
3,NLP helps machines understand human language


##  Apply Bag of Words using CountVectorizer

In [3]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Text'])




In [4]:
# Convert BoW representation into DataFrame for better understanding
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())



In [6]:
print("\nBag of Words Representation:")
bow_df


Bag of Words Representation:


Unnamed: 0,amazing,are,crucial,data,helps,human,in,is,language,machines,models,natural,nlp,powerful,processing,step,text,understand
0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0
1,0,1,0,0,0,0,1,0,1,0,1,0,1,1,0,0,0,0
2,0,0,1,1,0,0,1,1,0,0,0,0,1,0,1,1,1,0
3,0,0,0,0,1,1,0,0,1,1,0,0,1,0,0,0,0,1


## Explanation of Output
1. Vocabulary Construction
CountVectorizer extracts unique words from all sentences and forms a vocabulary.

2. Vector Representation
Each row in bow_df corresponds to a sentence, and each column represents a word from the vocabulary.

The value at a specific row-column position represents how many times that word appeared in that sentence.

## Use Cases of Bag of Words
1. Text Classification (Spam Detection, Sentiment Analysis)

2. Information Retrieval (Search Engines)

3. Topic Modeling (Grouping similar documents)

4. Recommendation Systems (User reviews, feedback analysis)

##Limitations of BoW
1. Ignores Context: Doesn't capture meaning or relationships between words.

2. Sparse Representation: Large vocabulary leads to high-dimensional data.

3. Doesn't Handle Synonyms: Different words with similar meanings are treated separately.