<a href="https://colab.research.google.com/github/Sambarlasagna/movie-sentiment-analysis/blob/main/CNN_MSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a deep learning model using CNN to analyze movie reviews


In [1]:
import collections

import datasets
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm

try:
  import torchtext
except:
  !pip install torchtext==0.17.2
  import torchtext


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.12/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.12/dist-packages/ipykernel/kernelapp.py", line 712, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.12/dist-package

### Getting the dataset from HuggingFace using the datasets library
Split the data into `train_data` and `test_data`



In [23]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

In [24]:
train_data,test_data

(Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }))

In [25]:
train_data[0],test_data[0]

({'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [26]:
train_data.features

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

## Tokenization

Machine learning models cannot work on strings, hence we are gonna split the strings and assign them unique values so that the model can work on these numerical values


In [27]:
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")

In [28]:
tokenizer("Hello guys!We will be building a ml model today")

['hello',
 'guys',
 '!',
 'we',
 'will',
 'be',
 'building',
 'a',
 'ml',
 'model',
 'today']

Adding a new column with tokens for each text in a row

also limitting the tokens to a `max_length` of few hundereds since sentiment can be predicted pretty well with just firts couple hundered tokens eliminating long and unnecessary ones


In [29]:
#Creating a function which takes in a dataset, and returns tokens in dict form

def tokenize_example(example,tokenizer,max_length):
  tokens = tokenizer(example["text"])[:max_length]
  return {"tokens":tokens}

Using the `map` method in `Dataset` class provided by the `dataset` library to update our `train_data` and `test_data`


In [30]:
# any arguemnts to the functions that arent example must be passed thru fn_kwargs dictioanry
max_length = 256

train_data = train_data.map(
    tokenize_example, fn_kwargs={"tokenizer":tokenizer,"max_length" : max_length}
)

test_data = test_data.map(
    tokenize_example, fn_kwargs={"tokenizer":tokenizer,"max_length" : max_length}
)

In [31]:
train_data,train_data.features

(Dataset({
     features: ['text', 'label', 'tokens'],
     num_rows: 25000
 }),
 {'text': Value('string'),
  'label': ClassLabel(names=['neg', 'pos']),
  'tokens': List(Value('string'))})

In [32]:
print(len(train_data))

25000


In [33]:
train_data[0]['tokens'][:10]

['i',
 'rented',
 'i',
 'am',
 'curious-yellow',
 'from',
 'my',
 'video',
 'store',
 'because']

### Creating Validation data
Every time we tune our model hyperparameters or training set-up to make it do a bit better on the test set, we are leak information from the test set into the training process. If we do this too often then we begin to overfit on the test set. Hence, we need some data which can act as a "proxy" test set which we can look at more frequently in order to evaluate how well our model actually does on unseen data -- this is the validation set.

In [34]:
test_size = 0.25
train_valid_data = train_data.train_test_split(test_size = test_size)
train_data = train_valid_data['train']
valid_data = train_valid_data['test']

In [36]:
len(train_data),len(valid_data),len(test_data)

(18750, 6250, 25000)