# Building a deep learning model using CNN to analyze movie reviews


In [13]:
import collections

import datasets
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm

try:
  import torchtext
except:
  !pip install torchtext
  import torchtext

### Getting the dataset from HuggingFace using the datasets library
Split the data into `train_data` and `test_data`



In [14]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

In [16]:
train_data,test_data

(Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 0)

In [17]:
train_data[0],test_data[0]

({'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [18]:
train_data.features

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

## Tokenization

Machine learning models cannot work on strings, hence we are gonna split the strings and assign them unique values so that the model can work on these numerical values


In [20]:
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")

In [24]:
tokenizer("Hello guys!We will be building a ml model today")

['hello',
 'guys',
 '!',
 'we',
 'will',
 'be',
 'building',
 'a',
 'ml',
 'model',
 'today']

Adding a new column with tokens for each text in a row

also limitting the tokens to a `max_length` of few hundereds since sentiment can be predicted pretty well with just firts couple hundered tokens eliminating long and unnecessary ones


In [27]:
#Creating a function which takes in a dataset, and returns tokens in dict form

def tokenize_example(example,tokenizer,max_length):
  tokens = tokenizer(example["text"])[:max_length]
  return {"tokens":tokens}

Using the `map` method in `Dataset` class provided by the `dataset` library to update our `train_data` and `test_data`


In [28]:
# any arguemnts to the functions that arent example must be passed thru fn_kwargs dictioanry
max_length = 256

train_data = train_data.map(
    tokenize_example, fn_kwargs={"tokenizer":tokenizer,"max_length" : max_length}
)

test_data = test_data.map(
    tokenize_example, fn_kwargs={"tokenizer":tokenizer,"max_length" : max_length}
)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [29]:
train_data,train_data.features

(Dataset({
     features: ['text', 'label', 'tokens'],
     num_rows: 25000
 }),
 {'text': Value('string'),
  'label': ClassLabel(names=['neg', 'pos']),
  'tokens': List(Value('string'))})

In [35]:
train_data[0]['tokens'][:10]

['i',
 'rented',
 'i',
 'am',
 'curious-yellow',
 'from',
 'my',
 'video',
 'store',
 'because']