### Bag of Words BoW

**Bag of Words** is a technique for extracting features from text data for machine learning tasks, such as text classification and sentiment analysis. This is important because machine learning algorithms can’t process textual data. The process of converting the text to numbers is known as feature extraction or feature encoding.  

A Bag of Words is based on the occurrence of words in a document. The process starts with finding the vocabulary in the text and measuring their occurrence. It is called a bag because the order and structure of words are not considered, just their occurrence. 

##### Manual implementation

In [1]:
## Step 1: Preprocessing the Text Data

## We'll start by defining a simple function to process text, including tokenization, lowercasing, and removing punctuation.

from collections import defaultdict
import string

# Sample text data: sentences
corpus = [
    "Python is amazing and fun.",
    "Python is not just fun but also powerful.",
    "Learning Python is fun!",
]
# Function to preprocess text
def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize: split the text into words
    tokens = text.split()
    return tokens

# Apply preprocessing to the sample corpus
processed_corpus = [preprocess(sentence) for sentence in corpus]
print(processed_corpus)

[['python', 'is', 'amazing', 'and', 'fun'], ['python', 'is', 'not', 'just', 'fun', 'but', 'also', 'powerful'], ['learning', 'python', 'is', 'fun']]


In [3]:
## Step 2: Build Vocabulary

## we need to scan through all the documents and build a complete list of unique words, that is our vocabulary

# Initialize an empty set for the vocabulary
vocabulary = set()

# Build the vocabulary
for sentence in processed_corpus:
    vocabulary.update(sentence)

# Convert to a sorted list
vocabulary = sorted(list(vocabulary))
print("Vocabulary:", vocabulary)

Vocabulary: ['also', 'amazing', 'and', 'but', 'fun', 'is', 'just', 'learning', 'not', 'powerful', 'python']


In [4]:
## Step 3: Calculate Word Frequencies and Vectorize

## We'll now calculate the frequency of each word in the vocabulary for every document in the processed corpus.

def create_bow_vector(sentence, vocab):
    vector = [0] * len(vocab)  # Initialize a vector of zeros
    for word in sentence:
        if word in vocab:
            idx = vocab.index(word)  # Find the index of the word in the vocabulary
            vector[idx] += 1  # Increment the count at that index
    return vector


In [7]:
## At this point, you will have created a Bag of Words representation for each document in your corpus.

# Create BoW vector for each sentence in the processed corpus
bow_vectors = [create_bow_vector(sentence, vocabulary) for sentence in processed_corpus]
print("Bag of Words Vectors:")
print(vocabulary)
for vector in bow_vectors:
    print(vector)


Bag of Words Vectors:
['also', 'amazing', 'and', 'but', 'fun', 'is', 'just', 'learning', 'not', 'powerful', 'python']
[0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1]
[1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1]
[0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]


##### Using Scikit-learn’s CountVectorizer

In [10]:
from sklearn.feature_extraction.text import CountVectorizer


# Original corpus
corpus = [
    "Python is amazing and fun.",
    "Python is not just fun but also powerful.",
    "Learning Python is fun!",
]

# Create a CountVectorizer Object
vectorizer = CountVectorizer()  # binary=True for binaryBoW and max_features = 10, to select top 10 features based on frequency

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Print the generated vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the Bag-of-Words matrix
print("BoW Representation:")
print(X.toarray())

Vocabulary: ['also' 'amazing' 'and' 'but' 'fun' 'is' 'just' 'learning' 'not'
 'powerful' 'python']
BoW Representation:
[[0 1 1 0 1 1 0 0 0 0 1]
 [1 0 0 1 1 1 1 0 1 1 1]
 [0 0 0 0 1 1 0 1 0 0 1]]


In [11]:
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)

Vocabulary: ['but' 'fantastic' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay'
 'terrible' 'the' 'was']
Document-Term Matrix:
 [[0 1 0 0 1 1 1 0 0 0 1 1]
 [1 0 1 0 0 0 1 1 1 0 1 1]
 [0 0 0 1 1 0 1 0 0 1 1 1]]
