<a href="https://colab.research.google.com/github/TAruna-SP/NLP/blob/week-1/Bag_of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bag of Words is a model in nlp which turns a sentence/sentences into collection of words and keep track of the frequency of each word in every sentence.THis way, a text is converted to numbers for machine learning model to process.

In [1]:
# Our small corpus (collection of documents)
documents = [
    "I love machine learning",
    "Machine learning is great",
    "I love coding in Python"
]
print("Our Corpus:")
for i, doc in enumerate(documents):
    print(f"Doc {i+1}: {doc}")

# Step 1: Create the vocabulary (unique words)
vocabulary = set()
for doc in documents:
    for word in doc.lower().split(): # Simple split for now
        vocabulary.add(word)
vocabulary = sorted(list(vocabulary)) # Sort for consistency
print(f"\nStep 1 - Vocabulary ({len(vocabulary)} words): {vocabulary}")

# Step 2: Create BoW vectors manually
print("\nStep 2 - Manual Bag-of-Words Vectors:")
for doc in documents:
    # Create a vector of zeros with same length as vocabulary
    vector = [0] * len(vocabulary)
    words = doc.lower().split()
    for word in words:
        # Find the index of this word in vocabulary and increment count
        index = vocabulary.index(word)
        vector[index] += 1
    print(f"'{doc}' -> {vector}")

Our Corpus:
Doc 1: I love machine learning
Doc 2: Machine learning is great
Doc 3: I love coding in Python

Step 1 - Vocabulary (9 words): ['coding', 'great', 'i', 'in', 'is', 'learning', 'love', 'machine', 'python']

Step 2 - Manual Bag-of-Words Vectors:
'I love machine learning' -> [0, 0, 1, 0, 0, 1, 1, 1, 0]
'Machine learning is great' -> [0, 1, 0, 0, 1, 1, 0, 1, 0]
'I love coding in Python' -> [1, 0, 1, 1, 0, 0, 1, 0, 1]


In [2]:
# First, install scikit-learn if you haven't
!pip install scikit-learn -q
print("sklearn installed/verified.")

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd # For nice table display

# Re-create our simple corpus
documents = [
    "I love machine learning",
    "Machine learning is great",
    "I love coding in Python"
]

# Create and fit the vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents) # 'X' is the standard name for the feature matrix

# Investigate the results
print("\n--- Professional Bag-of-Words with scikit-learn ---")
print(f"\n1. Vocabulary (Feature Names):")
print(vectorizer.get_feature_names_out())

print(f"\n2. Dense Matrix Representation (Documents x Words):")
# Convert from sparse matrix to dense array for display
dense_array = X.toarray()
print(dense_array)

print(f"\n3. As a Readable DataFrame:")
df_bow = pd.DataFrame(dense_array,
                     columns=vectorizer.get_feature_names_out(),
                     index=[f"Doc {i+1}" for i in range(len(documents))])
print(df_bow)

sklearn installed/verified.

--- Professional Bag-of-Words with scikit-learn ---

1. Vocabulary (Feature Names):
['coding' 'great' 'in' 'is' 'learning' 'love' 'machine' 'python']

2. Dense Matrix Representation (Documents x Words):
[[0 0 0 0 1 1 1 0]
 [0 1 0 1 1 0 1 0]
 [1 0 1 0 0 1 0 1]]

3. As a Readable DataFrame:
       coding  great  in  is  learning  love  machine  python
Doc 1       0      0   0   0         1     1        1       0
Doc 2       0      1   0   1         1     0        1       0
Doc 3       1      0   1   0         0     1        0       1


In [4]:
print("\n--- Enhanced BoW with Our NLP Tools ---")
# We can customize CountVectorizer to use our own tokenizer and stop words
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Ensure NLTK data is available
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab')

# Define a custom tokenizer function
def custom_tokenizer(text):
    # 1. Tokenize using NLTK
    tokens = word_tokenize(text.lower())
    # 2. Remove stopwords and single-character tokens
    stop_words = set(stopwords.words('english'))
    filtered = [w for w in tokens if w not in stop_words and len(w) > 1]
    return filtered

# Create vectorizer with our custom tokenizer
enhanced_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)
X_enhanced = enhanced_vectorizer.fit_transform(documents)

print("Vocabulary (after removing stopwords like 'i', 'is', 'in'):")
print(enhanced_vectorizer.get_feature_names_out())

df_enhanced = pd.DataFrame(X_enhanced.toarray(),
                           columns=enhanced_vectorizer.get_feature_names_out(),
                           index=[f"Doc {i+1}" for i in range(len(documents))])
print("\nEnhanced BoW DataFrame:")
print(df_enhanced)


--- Enhanced BoW with Our NLP Tools ---


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Vocabulary (after removing stopwords like 'i', 'is', 'in'):
['coding' 'great' 'learning' 'love' 'machine' 'python']

Enhanced BoW DataFrame:
       coding  great  learning  love  machine  python
Doc 1       0      0         1     1        1       0
Doc 2       0      1         1     0        1       0
Doc 3       1      0         0     1        0       1
