Bag of Words is a basic text representation technique in NLP.

It converts text into numbers (vectors) so that machine learning models can work with it.

The main idea:

Collect all the unique words (vocabulary).

Represent each document (sentence, paragraph, etc.) as a vector of word counts.

It’s called a “bag” because:

We only care about which words appear and how many times.

We do not care about word order or grammar.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Sample documents
doc1 = "I love Natural Language Processing."
doc2 = "Natural Language Processing loves Python."
documents = [doc1, doc2]

# Tokenize and preprocess
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
punctuations = set(string.punctuation)

tokenized_docs = []
for doc in documents:
    words = word_tokenize(doc.lower())  # lowercase + tokenize
    words = [w for w in words if w not in stop_words and w not in punctuations]  # remove stopwords & punctuation
    tokenized_docs.append(words)

print("Tokenized Documents:", tokenized_docs)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Tokenized Documents: [['love', 'natural', 'language', 'processing'], ['natural', 'language', 'processing', 'loves', 'python']]


In [2]:
# Build a vocabulary (unique words)
vocabulary = sorted(set([word for doc in tokenized_docs for word in doc]))
print("Vocabulary:", vocabulary)


Vocabulary: ['language', 'love', 'loves', 'natural', 'processing', 'python']


In [3]:
# Function to create BoW vector
def bow_vectorize(doc, vocabulary):
    vector = [0] * len(vocabulary)
    for word in doc:
        if word in vocabulary:
            index = vocabulary.index(word)
            vector[index] += 1
    return vector

# Apply to all documents
bow_matrix = [bow_vectorize(doc, vocabulary) for doc in tokenized_docs]

print("BoW Matrix:")
for vec in bow_matrix:
    print(vec)


BoW Matrix:
[1, 1, 0, 1, 1, 0]
[1, 0, 1, 1, 1, 1]


Vocabulary = ['language', 'love', 'loves', 'natural', 'processing', 'python']

First document → "love natural language processing" → [1, 1, 0, 1, 1, 0]

Second document → "natural language processing loves python" → [1, 0, 1, 1, 1, 1]