<a href="https://colab.research.google.com/github/Ehtisham1053/Natural-Language-Processing/blob/main/Bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Bag of Words (BoW) in NLP
The Bag of Words (BoW) model represents text data as numerical features. It counts the occurrences of words in a document and creates a feature matrix. Unlike One-Hot Encoding, BoW captures word frequency, making it useful for text classification and NLP tasks.

## Step-by-Step Implementation using DataFrame (DF)
We will:

1. Define a text corpus
2. Tokenize and preprocess the text
3. Create a vocabulary
4. Generate a BoW matrix using Pandas DataFrame for better visualization

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


## Step 2: Define a Sample Corpus

In [2]:
corpus = [
    "I love NLP and Machine Learning",
    "NLP is amazing and fun",
    "Machine Learning is powerful"
]


### 📌 Explanation:

We have 3 documents (sentences) in our corpus.
Our goal is to convert these sentences into a Bag of Words representation.

## Step 3: Apply Bag of Words using CountVectorizer

In [3]:
vectorizer = CountVectorizer()  # Initialize the BoW model
X = vectorizer.fit_transform(corpus)  # Transform text into a BoW matrix

In [7]:
vectorizer.get_feature_names_out()

array(['amazing', 'and', 'fun', 'is', 'learning', 'love', 'machine',
       'nlp', 'powerful'], dtype=object)

In [9]:
# jsut for the sake of the example
vectorizer.transform(['my name is amazing']).toarray()

array([[1, 0, 0, 1, 0, 0, 0, 0, 0]])

## Step 4: Convert to DataFrame for Better Understanding

In [4]:
vocab = vectorizer.get_feature_names_out()
df = pd.DataFrame(X.toarray(), columns=vocab)

df

Unnamed: 0,amazing,and,fun,is,learning,love,machine,nlp,powerful
0,0,1,0,0,1,1,1,1,0
1,1,1,1,1,0,0,0,1,0
2,0,0,0,1,1,0,1,0,1


## Understanding the Output
* Rows represent documents (sentences).
* Columns represent words in the vocabulary.
* Each cell contains the frequency of a word in the corresponding document.

In [6]:
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)


Vocabulary: ['amazing' 'and' 'fun' 'is' 'learning' 'love' 'machine' 'nlp' 'powerful']


# **Commonly Used Hyperparameters in CountVectorizer (Scikit-Learn)**

##ngram_range (Controlling N-Grams)
🔹 This parameter defines the range of n-grams (word sequences) to consider.

* ngram_range=(min_n, max_n)
* (1,1): Only unigrams (single words).
* (1,2): Unigrams + bigrams (single words and two-word phrases).
* (2,2): Only bigrams.
* (2,3): Bigrams + trigrams.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP is amazing"]

vectorizer = CountVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus)

print("uni gram plus biigram",vectorizer.get_feature_names_out())


uni gram plus biigram ['amazing' 'is' 'is amazing' 'love' 'love nlp' 'nlp' 'nlp is']


## 2. stop_words (Removing Common Words)
🔹 Removes common words (like "is", "the", "and") that do not add much meaning.

In [3]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())


['amazing' 'love' 'nlp']


* Words like "I", "is" are removed as they are common stop words.

## 3. max_features (Limiting Vocabulary Size)

3. max_features (Limiting Vocabulary Size)

**max_features=N**
* Useful when dealing with large vocabularies.

In [4]:
vectorizer = CountVectorizer(max_features=3)
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())


['amazing' 'is' 'nlp']


📌 Explanation:

The top 3 most frequent words are retained.

## 4. min_df & max_df (Filtering Words by Frequency)
* 🔹 min_df removes rare words (appearing in very few documents).
* 🔹 max_df removes very frequent words (appearing in too many documents).

* min_df=N   # Remove words that appear in fewer than N documents
* max_df=N   # Remove words that appear in more than N documents
* min_df=2: Keeps words appearing in at least 2 documents.
* max_df=0.8: Removes words appearing in more than 80% of the documents.

In [6]:
corpus = ["AI is great", "I love AI", "AI is the future", "Machine learning is AI"]

vectorizer = CountVectorizer(min_df=2)
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())


['ai' 'is']


Words like "great", "love", "future" are removed as they appear in only 1 document.

## 5. binary (Presence or Absence Instead of Count)
Useful for word presence detection instead of frequency-based analysis.

In [8]:
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(corpus)

print(X.toarray())

[[1 0 1 1 0 0 0 0]
 [1 0 0 0 0 1 0 0]
 [1 1 0 1 0 0 0 1]
 [1 0 0 1 1 0 1 0]]


Each word will have 1 if present and 0 if absent, rather than actual word counts.

## 6. lowercase (Convert Text to Lowercase)

🔹 Ensures all text is converted to lowercase before processing.

In [10]:
corpus = ["AI is Powerful", "ai is the future"]

vectorizer = CountVectorizer(lowercase=False)
X = vectorizer.fit_transform(corpus)

print("original data",vectorizer.get_feature_names_out())



corpus = ["AI is Powerful", "ai is the future"]

vectorizer = CountVectorizer(lowercase=True)
X = vectorizer.fit_transform(corpus)

print("after lowercasing",vectorizer.get_feature_names_out())

original data ['AI' 'Powerful' 'ai' 'future' 'is' 'the']
after lowercasing ['ai' 'future' 'is' 'powerful' 'the']


## 7. token_pattern (Customize Tokenization)
* token_pattern=r'\b\w+\b'  # Default: Extract words
* \b\w{3,}\b → Extracts words with at least 3 letters.
* \b[A-Za-z]+\b → Extracts only alphabetic words.

In [11]:
vectorizer = CountVectorizer(token_pattern=r'\b\w{3,}\b')  # Words with at least 3 letters
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())


['future' 'powerful' 'the']


📌 Explanation:

* Removes short words (less than 3 characters).