# Bag-of-Words Text Classification

In this tutorial we will show how to build a simple Bag of Words (BoW) text classifier using PyTorch. The classifier is trained on IMDB movie reviews dataset.

In [62]:
from pathlib import Path

import pandas as pd
import torch
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
# from google_drive_downloader import GoogleDriveDownloader as gdd
from torch.utils.data import DataLoader, Dataset
from sklearn.feature_extraction.text import CountVectorizer
from tqdm.notebook import tqdm
from datasets import load_dataset

In [63]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [64]:
DATA_PATH = 'sample_data/imdb_reviews.csv'
dataset = load_dataset("imdb")
df = pd.DataFrame(dataset['train'])
df.to_csv(DATA_PATH, index=False)
if not Path(DATA_PATH).is_file():
    gdd.download_file_from_google_drive(
        file_id='1zfM5E6HvKIe7f3rEt1V2gBpw5QOSSKQz',
        dest_path=DATA_PATH,
    )

In [65]:
df

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1


In [66]:
# View some example records
pd.read_csv(DATA_PATH).sample(5)

Unnamed: 0,text,label
18586,My college professor says that Othello may be ...,1
9804,"Relentlessly stupid, no-budget ""war picture"" m...",0
12287,"I happened upon this flick on a rainy Sunday, ...",0
7064,This movie has some of the most awesome cars I...,0
3391,What's with Indonesian musical movies? Never h...,0


## Bag-of-Words Representation

![](images/bow_diagram.png)

So the final bag-of-words vector for `['the', 'gray', 'cat', 'sat', 'on', 'the', 'gray', 'mat']` is `[0, 1, 1, 2, 2, 1, 0, 1]`.

In [67]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset

class Sequences(Dataset):
    def __init__(self, path):
        """
        Initializes the Dataset object.

        Args:
            path (str): The file path to the CSV data.
        """
        # TODO: Read the csv file from the given path into a pandas DataFrame.
        # Hint: Use pandas' read_csv function.
        df = pd.read_csv(path)

        # TODO: Initialize a CountVectorizer.
        # This will be used to convert text into a matrix of token counts.
        # Set the following parameters:
        # stop_words='english' -> to remove common English words.
        # max_df=0.99 -> to ignore terms that appear in more than 99% of documents.
        # min_df=0.005 -> to ignore terms that appear in less than 0.5% of documents.
        self.vectorizer = CountVectorizer(stop_words="english", max_df=0.99, min_df=0.005)

        # TODO: Fit the vectorizer to the 'review' column of your DataFrame and
        # transform the text into a document-term matrix.
        # This matrix represents the token counts for each review.
        # Hint: Use the fit_transform() method of your vectorizer.
        self.sequences = self.vectorizer.fit_transform(df["text"])

        # TODO: Get the 'label' column from the DataFrame and convert it to a list.
        # These are the labels for each corresponding sequence.
        self.labels = df["label"]

        # TODO: Get the vocabulary from the vectorizer.
        # This is a dictionary that maps each token (word) to its unique index.
        # Hint: The vocabulary is stored in the 'vocabulary_' attribute of the fitted vectorizer.
        self.token2idx = {word : index for word, index in self.vectorizer.vocabulary_.items()}

        # TODO: Create a reverse mapping from index to token.
        # This will be useful for interpreting the model's output later.
        # Hint: You can create this by swapping the keys and values of the token2idx dictionary.
        # A dictionary comprehension like {idx: token for token, idx in self.token2idx.items()} is perfect for this.
        self.idx2token = {index : word for word, index in self.vectorizer.vocabulary_.items()}

    def __getitem__(self, i):
        """
        Gets the i-th sample from the dataset.

        Args:
            i (int): The index of the sample to retrieve.

        Returns:
            tuple: A tuple containing the i-th sequence and its corresponding label.
        """
        # TODO: Return the i-th sequence and its corresponding label.
        # Note: The sequences are stored in a sparse matrix. You need to convert
        # the i-th sequence to a dense numpy array before returning.
        # Hint: Use .toarray() on the selected sequence slice.
        # The label can be directly accessed from the self.labels list.
        return self.sequences[i].toarray(), self.labels[i]

    def __len__(self):
        """
        Returns the total number of samples in the dataset.

        Returns:
            int: The total number of sequences.
        """
        # TODO: Return the total number of sequences in the dataset.
        # Hint: This is the number of rows in your self.sequences matrix.
        # You can get this from its .shape attribute.
        return self.sequences.shape[0]

In [68]:
dataset = Sequences(DATA_PATH)
train_loader = DataLoader(dataset, batch_size=4096)

print(dataset[5][0].shape)

(1, 2943)


## Model Definition

![](images/bow_training_diagram.png)

Layer 1 affine: $$x_1 = W_1 X + b_1$$
Layer 1 activation: $$h_1 = \textrm{Relu}(x_1)$$
Layer 2 affine: $$x_2 = W_2 h_1 + b_2$$
output: $$p = \sigma(x_2)$$
Loss: $$L = −(ylog(p)+(1−y)log(1−p))$$
Gradient:
$$\frac{\partial }{\partial W_1}L(W_1, b_1, W_2, b_2) = \frac{\partial L}{\partial p}\frac{\partial p}{\partial x_2}\frac{\partial x_2}{\partial h_1}\frac{\partial h_1}{\partial x_1}\frac{\partial x_1}{\partial W_1}$$

Parameter update:
$$W_1 = W_1 - \alpha \frac{\partial L}{\partial W_1}$$

In [69]:
# Import necessary libraries
import torch.nn as nn
import torch.nn.functional as F

class BagOfWordsClassifier(nn.Module):
    def __init__(self, vocab_size, hidden1, hidden2):
        """
        Initializes the BagOfWordsClassifier model.

        This model consists of three linear layers for a simple feed-forward network.

        Args:
            vocab_size (int): The number of unique tokens in the vocabulary.
                              This will be the input size for the first layer.
            hidden1 (int): The number of neurons in the first hidden layer.
            hidden2 (int): The number of neurons in the second hidden layer.
        """
        super(BagOfWordsClassifier, self).__init__()

        # TODO: Define the first fully connected (linear) layer. 🧠
        # It should take 'vocab_size' as input and have 'hidden1' as output.
        # Hint: Use nn.Linear()
        self.fc1 = nn.Linear(vocab_size, hidden1)

        # TODO: Define the second fully connected (linear) layer.
        # It should take 'hidden1' as input (from the previous layer)
        # and have 'hidden2' as output.
        self.fc2 = nn.Linear(hidden1, hidden2)

        # TODO: Define the output layer.
        # It should take 'hidden2' as input and have 1 as output.
        # The output size is 1 because we are performing binary classification
        # (e.g., positive vs. negative sentiment).
        self.fc3 = nn.Linear(hidden2, 1)

    def forward(self, inputs):
        """
        Defines the forward pass of the model.

        This method describes how the input data flows through the network layers.

        Args:
            inputs (torch.Tensor): The input tensor containing the bag-of-words vectors.
                                   Its shape will be (batch_size, 1, vocab_size).

        Returns:
            torch.Tensor: The output logits from the model.
        """
        # TODO: Pass the inputs through the first linear layer (self.fc1)
        # and apply a ReLU activation function.
        # Important: The input tensor has a shape of (batch_size, 1, vocab_size).
        # You must first remove the dimension of size 1 using .squeeze(1) and ensure
        # the data type is float using .float() before passing it to the linear layer.
        # The sequence should be: squeeze -> float -> fc1 -> relu
        # Hint: Use F.relu() for the activation.
        x = F.relu(self.fc1(inputs.squeeze(1).float()))

        # TODO: Pass the result from the previous step ('x') through the second
        # linear layer (self.fc2) and apply another ReLU activation function.
        x = F.relu(self.fc2(x))

        # TODO: Pass the result through the final output layer (self.fc3) and return it.
        # Note: We don't apply a sigmoid activation here. This is because we will likely
        # use a loss function like BCEWithLogitsLoss, which is more numerically stable
        # and applies the sigmoid function internally.
        return self.fc3(x)

In [70]:
model = BagOfWordsClassifier(len(dataset.token2idx), 128, 64)
model

BagOfWordsClassifier(
  (fc1): Linear(in_features=2943, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
)

In [71]:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad], lr=0.001)

model.train()
train_losses = []
for epoch in range(10):
    progress_bar = tqdm(train_loader, leave=False)
    losses = []
    total = 0
    for inputs, target in progress_bar:
        model.zero_grad()

        output = model(inputs)
        loss = criterion(output.squeeze(), target.float())

        loss.backward()

        nn.utils.clip_grad_norm_(model.parameters(), 3)

        optimizer.step()

        progress_bar.set_description(f'Loss: {loss.item():.3f}')

        losses.append(loss.item())
        total += 1

    epoch_loss = sum(losses) / total
    train_losses.append(epoch_loss)

    tqdm.write(f'Epoch #{epoch + 1}\tTrain Loss: {epoch_loss:.3f}')

  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #1	Train Loss: 0.744


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #2	Train Loss: 0.700


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #3	Train Loss: 0.698


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #4	Train Loss: 0.696


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #5	Train Loss: 0.693


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #6	Train Loss: 0.688


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #7	Train Loss: 0.679


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #8	Train Loss: 0.665


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #9	Train Loss: 0.644


  0%|          | 0/7 [00:00<?, ?it/s]

Epoch #10	Train Loss: 0.619


In [72]:
def predict_sentiment(text):
    model.eval()
    with torch.no_grad():
        test_vector = torch.LongTensor(dataset.vectorizer.transform([text]).toarray())

        output = model(test_vector)
        prediction = torch.sigmoid(output).item()

        if prediction > 0.5:
            print(f'{prediction:0.3}: Positive sentiment')
        else:
            print(f'{prediction:0.3}: Negative sentiment')

## Analyzing reviews for "Cool Cat Saves the Kids"

![](https://m.media-amazon.com/images/M/MV5BNzE1OTY3OTk5M15BMl5BanBnXkFtZTgwODE0Mjc1NDE@._V1_UY268_CR11,0,182,268_AL_.jpg)

In [73]:
test_text = """
This poor excuse for a movie is terrible. It has been 'so good it's bad' for a
while, and the high ratings are a good form of sarcasm, I have to admit. But
now it has to stop. Technically inept, spoon-feeding mundane messages with the
artistic weight of an eighties' commercial, hypocritical to say the least, it
deserves to fall into oblivion. Mr. Derek, I hope you realize you are like that
weird friend that everybody know is lame, but out of kindness and Christian
duty is treated like he's cool or something. That works if you are a good
decent human being, not if you are a horrible arrogant bully like you are. Yes,
Mr. 'Daddy' Derek will end on the history books of the internet for being a
delusional sour old man who thinks to be a good example for kids, but actually
has a poster of Kim Jong-Un in his closet. Destroy this movie if you all have a
conscience, as I hope IHE and all other youtube channel force-closed by Derek
out of SPITE would destroy him in the courts.This poor excuse for a movie is
terrible. It has been 'so good it's bad' for a while, and the high ratings are
a good form of sarcasm, I have to admit. But now it has to stop. Technically
inept, spoon-feeding mundane messages with the artistic weight of an eighties'
commercial, hypocritical to say the least, it deserves to fall into oblivion.
Mr. Derek, I hope you realize you are like that weird friend that everybody
know is lame, but out of kindness and Christian duty is treated like he's cool
or something. That works if you are a good decent human being, not if you are a
horrible arrogant bully like you are. Yes, Mr. 'Daddy' Derek will end on the
history books of the internet for being a delusional sour old man who thinks to
be a good example for kids, but actually has a poster of Kim Jong-Un in his
closet. Destroy this movie if you all have a conscience, as I hope IHE and all
other youtube channel force-closed by Derek out of SPITE would destroy him in
the courts.
"""
predict_sentiment(test_text)

0.298: Negative sentiment


In [74]:
test_text = """
Cool Cat Saves The Kids is a symbolic masterpiece directed by Derek Savage that
is not only satirical in the way it makes fun of the media and politics, but in
the way in questions as how we humans live life and how society tells us to
live life.

Before I get into those details, I wanna talk about the special effects in this
film. They are ASTONISHING, and it shocks me that Cool Cat Saves The Kids got
snubbed by the Oscars for Best Special Effects. This film makes 2001 look like
garbage, and the directing in this film makes Stanley Kubrick look like the
worst director ever. You know what other film did that? Birdemic: Shock and
Terror. Both of these films are masterpieces, but if I had to choose my
favorite out of the 2, I would have to go with Cool Cat Saves The Kids. It is
now my 10th favorite film of all time.

Now, lets get into the symbolism: So you might be asking yourself, Why is Cool
Cat Orange? Well, I can easily explain. Orange is a color. Orange is also a
fruit, and its a very good fruit. You know what else is good? Good behavior.
What behavior does Cool Cat have? He has good behavior. This cannot be a
coincidence, since cool cat has good behavior in the film.

Now, why is Butch The Bully fat? Well, fat means your wide. You wanna know who
was wide? Hitler. Nuff said this cannot be a coincidence.

Why does Erik Estrada suspect Butch The Bully to be a bully? Well look at it
this way. What color of a shirt was Butchy wearing when he walks into the area?
I don't know, its looks like dark purple/dark blue. Why rhymes with dark? Mark.
Mark is that guy from the Room. The Room is the best movie of all time. What is
the opposite of best? Worst. This is how Erik knew Butch was a bully.

and finally, how come Vivica A. Fox isn't having a successful career after
making Kill Bill.

I actually can't answer that question.

Well thanks for reading my review.
"""
predict_sentiment(test_text)

0.492: Negative sentiment


In [75]:
test_text = """
Don't let any bullies out there try and shape your judgment on this gem of a
title.

Some people really don't have anything better to do, except trash a great movie
with annoying 1-star votes and spread lies on the Internet about how "dumb"
Cool Cat is.

I wouldn't be surprised to learn if much of the unwarranted negativity hurled
at this movie is coming from people who haven't even watched this movie for
themselves in the first place. Those people are no worse than the Butch the
Bully, the film's repulsive antagonist.

As it just so happens, one of the main points of "Cool Cat Saves the Kids" is
in addressing the attitudes of mean naysayers who try to demean others who
strive to bring good attitudes and fun vibes into people's lives. The message
to be learned here is that if one is friendly and good to others, the world is
friendly and good to one in return, and that is cool. Conversely, if one is
miserable and leaving 1-star votes on IMDb, one is alone and doesn't have any
friends at all. Ain't that the truth?

The world has uncovered a great, new, young filmmaking talent in "Cool Cat"
creator Derek Savage, and I sure hope that this is only the first of many
amazing films and stories that the world has yet to appreciate.

If you are a cool person who likes to have lots of fun, I guarantee that this
is a movie with charm that will uplift your spirits and reaffirm your positive
attitudes towards life.
"""
predict_sentiment(test_text)

0.523: Positive sentiment


In [76]:
test_text = """
What the heck is this ? There is not one redeeming quality about this terrible
and very poorly done "movie". I can't even say that it's a "so bad it's good
movie".It is undeniably pointless to address all the things wrong here but
unfortunately even the "life lessons" about bullies and stuff like this are so
wrong and terrible that no kid should hear them.The costume is also horrible
and the acting...just unbelievable.No effort whatsoever was put into this thing
and it clearly shows,I have no idea what were they thinking or who was it even
meant for. I feel violated after watching this trash and I deeply recommend you
stay as far away as possible.This is certainly one of the worst pieces of c***
I have ever seen.
"""
predict_sentiment(test_text)

0.326: Negative sentiment
