<a href="https://colab.research.google.com/github/Sambarlasagna/movie-sentiment-analysis/blob/main/CNN_MSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a deep learning model using CNN to analyze movie reviews


In [20]:
import collections

import datasets
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm

try:
  import torchtext
except:
  !pip install torchtext==0.17.2
  import torchtext

### Getting the dataset from HuggingFace using the datasets library
Split the data into `train_data` and `test_data`



In [169]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

In [170]:
train_data,test_data

(Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }))

In [171]:
train_data[0],test_data[0]

({'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [172]:
train_data.features

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

## Tokenization

Machine learning models cannot work on strings, hence we are gonna split the strings and assign them unique values so that the model can work on these numerical values


In [173]:
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")

In [174]:
tokenizer("Hello guys!We will be building a ml model today")

['hello',
 'guys',
 '!',
 'we',
 'will',
 'be',
 'building',
 'a',
 'ml',
 'model',
 'today']

Adding a new column with tokens for each text in a row

also limitting the tokens to a `max_length` of few hundereds since sentiment can be predicted pretty well with just firts couple hundered tokens eliminating long and unnecessary ones


In [175]:
#Creating a function which takes in a dataset, and returns tokens in dict form

def tokenize_example(example,tokenizer,max_length):
  tokens = tokenizer(example["text"])[:max_length]
  return {"tokens":tokens}

Using the `map` method in `Dataset` class provided by the `dataset` library to update our `train_data` and `test_data`


In [176]:
# any arguemnts to the functions that arent example must be passed thru fn_kwargs dictioanry
max_length = 256

train_data = train_data.map(
    tokenize_example, fn_kwargs={"tokenizer":tokenizer,"max_length" : max_length}
)

test_data = test_data.map(
    tokenize_example, fn_kwargs={"tokenizer":tokenizer,"max_length" : max_length}
)

In [177]:
train_data,train_data.features

(Dataset({
     features: ['text', 'label', 'tokens'],
     num_rows: 25000
 }),
 {'text': Value('string'),
  'label': ClassLabel(names=['neg', 'pos']),
  'tokens': List(Value('string'))})

In [178]:
print(len(train_data))

25000


In [179]:
train_data[0]['tokens'][:10]

['i',
 'rented',
 'i',
 'am',
 'curious-yellow',
 'from',
 'my',
 'video',
 'store',
 'because']

### Creating Validation data
Every time we tune our model hyperparameters or training set-up to make it do a bit better on the test set, we are leak information from the test set into the training process. If we do this too often then we begin to overfit on the test set. Hence, we need some data which can act as a "proxy" test set which we can look at more frequently in order to evaluate how well our model actually does on unseen data -- this is the validation set.

In [180]:
test_size = 0.25
train_valid_data = train_data.train_test_split(test_size = test_size)

# Explicitly check if the split data is not None before reassigning
if train_valid_data is not None:
    train_data = train_valid_data['train']
    valid_data = train_valid_data['test']
else:
    print("train_test_split returned None") # Add a print statement for debugging

In [181]:
len(train_data),len(valid_data),len(test_data)

(18750, 6250, 25000)

### Creating vocabulary
assigning indexes for to tokens since models cant operate on strings. Also creating minimum freq , tokens above this min freq will be taken and others will marked as `<unk>` and `<pad` we will be using for padding sentences

In [182]:
min_freq = 5
special_tokens = ["<unk>","<pad>"]

vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["tokens"],
    min_freq - min_freq,
    specials = special_tokens
)

In [183]:
len(vocab)

72773

In [184]:
# to view the tokens
vocab.get_itos()[:10]

['<unk>', '<pad>', 'the', '.', ',', 'a', 'and', 'of', 'to', "'"]

In [185]:
vocab['and']

6

In [186]:
unk_index = vocab["<unk>"]
pad_index = vocab["<pad>"]

In [187]:
"random_token" in vocab

False

In [188]:
# vocab["random_token"] this will throw in error if the token is not present
# hence make it return <unk>
vocab.set_default_index(unk_index)

In [189]:
vocab["random_token"]

0

In [190]:
vocab.lookup_indices(["hello","world","some_token","<pad>"])


[5005, 189, 0, 1]

### Numericalizing Data
to convert our tokens to indices fromt he datsets

In [191]:
def numericalize_example(example,vocab):
  ids = vocab.lookup_indices(example["tokens"])
  return {"ids":ids}

In [192]:
train_data = train_data.map(numericalize_example,fn_kwargs = {"vocab":vocab}, load_from_cache_file=False)
valid_data = valid_data.map(numericalize_example,fn_kwargs = {"vocab":vocab}, load_from_cache_file=False)
test_data = test_data.map(numericalize_example,fn_kwargs = {"vocab":vocab}, load_from_cache_file=False)

Map:   0%|          | 0/18750 [00:00<?, ? examples/s]

Map:   0%|          | 0/6250 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

| Access           | `ids` shape    | `label` shape |
| ---------------- | -------------- | ------------- |
| `train_data[0]`  | `(seq_len,)`   | scalar        |
| `train_data[:1]` | `(1, seq_len)` | `(1,)`        |
| `train_data[:n]` | `(n, seq_len)` | `(n,)`        |


In [193]:
train_data[0]["tokens"][:10]

['anyone',
 'who',
 'has',
 'watched',
 'comedy',
 'central',
 'around',
 'midnight',
 'in',
 'the']

In [194]:
vocab.lookup_indices(train_data[0]["tokens"][:10])

[256, 42, 50, 251, 201, 1372, 197, 2968, 13, 2]

In [195]:
train_data[0]["ids"][:10]

[256, 42, 50, 251, 201, 1372, 197, 2968, 13, 2]

In [204]:
#Convert ids and label from integers to pytorch tensors
train_data = train_data.with_format(type = "torch", columns = ["ids","label"])
valid_data = valid_data.with_format(type = "torch", columns = ["ids","label"])
test_data = test_data.with_format(type = "torch", columns = ["ids","label"])

In [217]:
sample = train_data[:4]
print(sample['label'])

tensor([0, 0, 0, 0])


In [221]:
train_data[:1].keys()

dict_keys(['label', 'ids'])

Removing the "tokens" field is fine, as if we wanted to retrieve the human-readable tokens again we can simply convert the tensor into a Python list of integers and then use the vocabulary's lookup_tokens method.

In [226]:

vocab.lookup_tokens(train_data[:1]["ids"][0][:10].tolist())

['anyone',
 'who',
 'has',
 'watched',
 'comedy',
 'central',
 'around',
 'midnight',
 'in',
 'the']

### Creating DataLoaders

The `collate_fn` takes a list of individual data samples, pads the sequences of token IDs to a uniform length within the batch, stacks the labels, and returns a single dictionary representing a batch ready for use in a PyTorch DataLoader.

In [227]:
def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_ids = [i["ids"] for i in batch]
        batch_ids = nn.utils.rnn.pad_sequence(
            batch_ids, padding_value=pad_index, batch_first=True
        )
        batch_label = [i["label"] for i in batch]
        batch_label = torch.stack(batch_label)
        batch = {"ids": batch_ids, "label": batch_label}
        return batch

    return collate_fn

`batch_ids = nn.utils.rnn.pad_sequence(batch_ids, padding_value=pad_index, batch_first=True)`: This is a crucial step. Since the reviews have different lengths, the tensors in `batch_ids` also have different lengths. `nn.utils.rnn.pad_sequence` is a PyTorch function that pads these tensors to the same length (the length of the longest tensor in the batch) by adding the `pad_index` value.

In [228]:

def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

In [229]:
batch_size = 512

train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

In [230]:
class CNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dim,
        n_filters,
        filter_sizes,
        output_dim,
        dropout_rate,
        pad_index,
    ):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index) #Creates an embedding layer
        #An embedding layer converts the numericalized tokens (word IDs) into dense vectors of fixed size (embedding_dim).
        self.convs = nn.ModuleList( #Creates a list of conv layers
            [
                nn.Conv1d(embedding_dim, n_filters, filter_size)
                for filter_size in filter_sizes
            ]
        )
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        #This creates a fully connected (linear) layer.
        #It takes the concatenated output from all the convolutional filters and maps it to the final output dimension (output_dim),
        #which would typically be the number of classes (e.g., 2 for positive/negative sentiment).

        self.dropout = nn.Dropout(dropout_rate)
        # regularization technique to prevent overfitting by randomly setting a fraction of input units to 0 at each update during training.
    def forward(self, ids):
        # ids = [batch size, seq len]
        embedded = self.dropout(self.embedding(ids))
        # embedded = [batch size, seq len, embedding dim]
        embedded = embedded.permute(0, 2, 1)
        # embedded = [batch size, embedding dim, seq len]
        conved = [torch.relu(conv(embedded)) for conv in self.convs]
        # conved_n = [batch size, n filters, seq len - filter_sizes[n] + 1]
        pooled = [conv.max(dim=-1).values for conv in conved]
        # pooled_n = [batch size, n filters]
        cat = self.dropout(torch.cat(pooled, dim=-1))
        # cat = [batch size, n filters * len(filter_sizes)]
        prediction = self.fc(cat)
        # prediction = [batch size, output dim]
        return prediction

In summary, this CNN model for text classification works by:

Converting token IDs into dense embeddings.

Applying multiple convolutional filters of different sizes to capture local patterns (n-grams) in the embedded sequences.

Using max pooling to extract the most important features from each filter's output.

Concatenating the pooled features.
Passing the concatenated features through a linear layer to produce the final sentiment prediction.

Using dropout for regularization.

In [231]:

vocab_size = len(vocab)
embedding_dim = 300
n_filters = 100
filter_sizes = [3, 5, 7]
output_dim = len(train_data.unique("label"))
dropout_rate = 0.25

model = CNN(
    vocab_size,
    embedding_dim,
    n_filters,
    filter_sizes,
    output_dim,
    dropout_rate,
    pad_index,
)

In [232]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 22,282,802 trainable parameters


In [233]:

def initialize_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_normal_(m.weight)
        nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv1d):
        nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
        nn.init.zeros_(m.bias)

In [234]:

model.apply(initialize_weights)

CNN(
  (embedding): Embedding(72773, 300, padding_idx=1)
  (convs): ModuleList(
    (0): Conv1d(300, 100, kernel_size=(3,), stride=(1,))
    (1): Conv1d(300, 100, kernel_size=(5,), stride=(1,))
    (2): Conv1d(300, 100, kernel_size=(7,), stride=(1,))
  )
  (fc): Linear(in_features=300, out_features=2, bias=True)
  (dropout): Dropout(p=0.25, inplace=False)
)

In [None]:
vectors = torchtext.vocab.GloVe()

.vector_cache/glove.840B.300d.zip:  15%|█▌        | 336M/2.18G [03:40<15:37, 1.96MB/s]

In [None]:
pretrained_embedding = vectors.get_vecs_by_tokens(vocab.get_itos())

In [None]:

model.embedding.weight.data = pretrained_embedding

In [None]:

optimizer = optim.Adam(model.parameters())

In [None]:

criterion = nn.CrossEntropyLoss()

In [None]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device

In [None]:

model = model.to(device)
criterion = criterion.to(device)

In [None]:

def train(data_loader, model, criterion, optimizer, device):
    model.train()
    epoch_losses = []
    epoch_accs = []
    for batch in tqdm.tqdm(data_loader, desc="training..."):
        ids = batch["ids"].to(device)
        label = batch["label"].to(device)
        prediction = model(ids)
        loss = criterion(prediction, label)
        accuracy = get_accuracy(prediction, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

In [None]:
def evaluate(data_loader, model, criterion, device):
    model.eval()
    epoch_losses = []
    epoch_accs = []
    with torch.no_grad():
        for batch in tqdm.tqdm(data_loader, desc="evaluating..."):
            ids = batch["ids"].to(device)
            label = batch["label"].to(device)
            prediction = model(ids)
            loss = criterion(prediction, label)
            accuracy = get_accuracy(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

In [None]:

def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

In [None]:

def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

In [None]:

fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(1, 1, 1)
ax.plot(metrics["train_losses"], label="train loss")
ax.plot(metrics["valid_losses"], label="valid loss")
ax.set_xlabel("epoch")
ax.set_ylabel("loss")
ax.set_xticks(range(n_epochs))
ax.legend()
ax.grid()