1. Bag-of-Words (BoW) Method
The Bag of Words is a potent Natural Language Processing technique for text modeling, extracting features numerically simply, and flexibly, disregarding grammar and word order. Utilizing the CountVectorizer from Scikit-learn allows us to effortlessly create a Bag-of-Words.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(ngram_range=(1,3))
sample_text = ["I am Aftab Mallick",
               "I am Interested in learning NLP",
               "I know Machine Learning"]
x=vectorizer.fit_transform(sample_text)
print(f"Vocabulary: {vectorizer.vocabulary_}")
print(f"Feature Names: {vectorizer.get_feature_names_out()}")
print(f"Document terms: \n{x.toarray()}")

Vocabulary: {'am': 2, 'aftab': 0, 'mallick': 20, 'am aftab': 3, 'aftab mallick': 1, 'am aftab mallick': 4, 'interested': 10, 'in': 7, 'learning': 16, 'nlp': 21, 'am interested': 5, 'interested in': 11, 'in learning': 8, 'learning nlp': 17, 'am interested in': 6, 'interested in learning': 12, 'in learning nlp': 9, 'know': 13, 'machine': 18, 'know machine': 14, 'machine learning': 19, 'know machine learning': 15}
Feature Names: ['aftab' 'aftab mallick' 'am' 'am aftab' 'am aftab mallick'
 'am interested' 'am interested in' 'in' 'in learning' 'in learning nlp'
 'interested' 'interested in' 'interested in learning' 'know'
 'know machine' 'know machine learning' 'learning' 'learning nlp'
 'machine' 'machine learning' 'mallick' 'nlp']
Document terms: 
[[1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0]]


Basic Manual coding for Bag of Words

In [28]:
sample_text = [
    "I am Aftab Mallick",
    "I am Interested in learning NLP",
    "I know Machine Learning"
]

# Create a vocabulary list from the sample_text
vocabulary = [word for sentence in sample_text for word in sentence.split()]

# Convert the list to a set to remove duplicates, then back to a list to maintain ordering
vocabulary = list(set(vocabulary))

# Create a dictionary to map each word to an index
vocabulary_dict = {word: i for i, word in enumerate(vocabulary)}

# Initialize the sample_text_vector with zeros
sample_text_vector = [[0] * len(vocabulary) for _ in range(len(sample_text))]

# Fill in the sample_text_vector
for i, sentence in enumerate(sample_text):
    for word in sentence.split():
        if word in vocabulary_dict:  # Check if the word is in the dictionary
            sample_text_vector[i][vocabulary_dict[word]] = 1

print(f"Vocabulary: {vocabulary_dict}")
print(f"Document terms: \n{sample_text_vector}")



Vocabulary: {'I': 0, 'know': 1, 'NLP': 2, 'Learning': 3, 'am': 4, 'Aftab': 5, 'Interested': 6, 'in': 7, 'Mallick': 8, 'learning': 9, 'Machine': 10}
Document terms: 
[[1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0], [1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0], [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1]]
