In this work, I use torchtext library to preprocess and model IMDB reviews. Preprocessing includes tokenization, stemming and removal of stop words. Additionally, glove word vectors are used to improve performance. Finally, after obtaining word vectors, neural network based binary classifier is built.

In [3]:
import torch
print(torch.__version__) #version of the pytorh
import torch.nn.functional as F

!pip install -U torch==1.8.0 torchtext==0.9.0
import torchtext.legacy as torchtext
import random

import nltk
!pip install torchtext.data.BucketIterator
from torchtext.legacy.data import BucketIterator
nltk.download('punkt')

nltk.download('punkt')
nltk.download("stopwords")
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.stem.porter import *

1.7.0
Collecting torch==1.8.0
  Downloading torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5 MB)
[K     |████████████████████████████████| 735.5 MB 10 kB/s s eta 0:00:01   |██▍                             | 55.2 MB 2.1 MB/s eta 0:05:23     |██▌                             | 56.7 MB 2.1 MB/s eta 0:05:23     |██▊                             | 63.4 MB 27.6 MB/s eta 0:00:25     |███▋                            | 84.2 MB 27.6 MB/s eta 0:00:24     |███▊                            | 85.9 MB 27.6 MB/s eta 0:00:24     |████▎                           | 99.3 MB 25.8 MB/s eta 0:00:25     |██████████▉                     | 249.7 MB 65.9 MB/s eta 0:00:08     |███████████                     | 251.2 MB 65.9 MB/s eta 0:00:08     |███████████                     | 252.8 MB 65.9 MB/s eta 0:00:08     |███████████▏                    | 256.5 MB 65.9 MB/s eta 0:00:08     |███████████▎                    | 258.1 MB 65.9 MB/s eta 0:00:08     |██████████████▎                 | 328.9 MB 51.7 MB/s eta 0:00

**Preprocessing and Data Preperation**

Below, stemmer, stop word removal and tokenizer are defined. Then, those preprocessing steps are applied using data.field object.

In [5]:
stemmer = SnowballStemmer("english")
stop_words = stopwords.words("english")

def tokenize_stem_stop(text): 

  nltk_tokens = nltk.word_tokenize(text)
  stems = [stemmer.stem(token) for token in nltk_tokens]
  filtered = []
  for w in stems: 
    if w not in stop_words: 
        filtered.append(w) 
  return filtered
  

In [6]:
TEXT = torchtext.data.Field(tokenize= tokenize_stem_stop,  batch_first=True) # preprocessing paraneters can be used to add aditional  preprocessing steps
LABEL = torchtext.data.LabelField(dtype = torch.float)

Dataset is downloaded from torchtext.datasets since it contains many large text sources for various applications.

In [7]:
train_data, test_data = torchtext.datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 45.1MB/s]


Below, maximum length are found to define fix number of words from each review.

In [8]:
max_size=0  ## this part of the code find maximum length of the network
count=0
sum= 0
for i in  range(len(train_data)):
  if max_size < len(train_data[i].text):
    max_size =len(train_data[i].text)
    print(max_size)
  count +=1
  sum +=len(train_data[0].text)
print("avarage: ", sum/count)

218
653
749
1362
1437
1799
avarage:  218.0


In [9]:
TEXT = torchtext.data.Field(tokenize=tokenize_stem_stop, batch_first=True,fix_length= 1799
                            
                            ) # preprocessing parameters can be used to add aditional  preprocessing steps
LABEL = torchtext.data.LabelField(dtype = torch.float)
train_data, test_data = torchtext.datasets.IMDB.splits(TEXT, LABEL) 
print("train length is: ",len(train_data))
print("test length is: ",len(test_data))
print(vars(train_data[0]))

train length is:  25000
test length is:  25000
{'text': ['feel', 'film', 'rare', 'treasur', '.', 'onli', 'begin', 'shirley', 'templ', "'s", 'career', ',', 'rare', 'look', 'societi', 'chang', '.', 'understand', ',', 'certain', 'thing', 'today', 'would', 'view', 'sexual', ',', 'back', 'would', 'consid', 'innoc', '.', 'exampl', ',', 'parent', 'children', 'film', 'well', 'mani', 'parent', 'took', 'children', 'see', 'movi', ',', 'saw', 'children', 'mimick', 'adult', '.', 'peopl', "n't", 'think', 'anyon', 'view', 'children', 'sexual', 'attract', ',', 'teenag', 'boy', 'lust', 'teenag', 'girl', '.', "n't", 'sexual', '.', 'mind', 'befor', 'internet', ',', 'tv', ',', 'etc', '...', 'sex', 'crime', "n't", 'open', 'brought', '.', 'occasion', 'would', 'whisper', 'kid', '``', 'funni', 'uncl', '.', "''", 'often', 'came', '.', 'yes', 'veri', 'sad', '.', 'kinda', 'sad', 'today', ',', 'even', 'see', 'film', 'anyth', 'intend', ',', 'innoc', 'funni', '.', 'saw', 'shirley', 'danc', 'like', 'boy', 'eye', 'ba

An example of preprocessed movie review can be seen. Lemmatization is more sophisticated approach then stemming however, it is more costly. 

In [10]:
TEXT.build_vocab(train_data, max_size = 20000,
                 vectors="glove.6B.100d",
                 unk_init = torch.Tensor.normal_)
LABEL.build_vocab(train_data)
print("Unique tokens in TEXT vocabulary:",len(TEXT.vocab))
print("Unique tokens in LABEL vocabulary:",len(LABEL.vocab))

.vector_cache/glove.6B.zip: 862MB [02:40, 5.37MB/s]                               
100%|█████████▉| 399999/400000 [00:20<00:00, 19646.53it/s]


Unique tokens in TEXT vocabulary: 20002
Unique tokens in LABEL vocabulary: 2


In [11]:
print("Unique tokens in TEXT vocabulary:",len(TEXT.vocab))
print("Unique tokens in LABEL vocabulary:",len(LABEL.vocab))
print(TEXT.vocab.freqs.most_common(20))
print(LABEL.vocab.freqs)
print(TEXT.unk_token)
print(TEXT.pad_token)

Unique tokens in TEXT vocabulary: 20002
Unique tokens in LABEL vocabulary: 2
[(',', 275887), ('.', 237172), ('/', 102097), ('>', 102036), ('<', 101971), ('br', 101871), ("'s", 62159), ('movi', 49987), ('film', 46624), (')', 36175), ('(', 35397), ("n't", 33376), ("''", 33138), ('``', 32786), ('one', 26807), ('!', 24560), ('like', 22243), ('?', 16088), ('time', 15225), ('good', 14871)]
Counter({'pos': 12500, 'neg': 12500})
<unk>
<pad>


In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

##Create a train and test iterators using Bucket iterator method with batch size 32
train_iterator, test_iterator = BucketIterator.splits((train_data,test_data),batch_size=64,device=device)

**Model**

The important part in modelling is the embedding stage. We can see that embedding dimensions, padding idx and num_embeddings are defined from previous steps.

After, word embeddings are generated, these vectors are fed into 2 layer neural network for binary classification.

In [13]:
class Network(torch.nn.Module):
    def __init__(self,pad_idx):
        super().__init__()
        self.embedding = torch.nn.Embedding(num_embeddings = 20002, embedding_dim =100,padding_idx = pad_idx)
        self.layer1 = torch.nn.Linear(1799*100, 1000)
        self.layer2 = torch.nn.Linear(1000, 1)


    def forward(self, x):
        x = self.embedding(x).view(x.size(0),-1)
        x = self.layer1(x)
        x = F.relu(x)
        x = self.layer2(x)

        return x       

In [14]:
model = Network(pad_idx = TEXT.vocab.stoi[TEXT.pad_token])
print(model)

Network(
  (embedding): Embedding(20002, 100, padding_idx=1)
  (layer1): Linear(in_features=179900, out_features=1000, bias=True)
  (layer2): Linear(in_features=1000, out_features=1, bias=True)
)


In [15]:
model.embedding.weight.data.copy_(TEXT.vocab.vectors)
model.embedding.weight.data[TEXT.vocab.stoi[TEXT.unk_token]] = torch.zeros(100)
model.embedding.weight.data[TEXT.vocab.stoi[TEXT.pad_token]] = torch.zeros(100)
model.embedding.requires_grad = False

In [16]:
# Choose a Loss function from torch.nn according to your network
loss_fn = torch.nn.BCEWithLogitsLoss()

#Define an Adam optimizer with learning rate 0.001 to optimize the parameters of our network
optimizer =  torch.optim.Adam(params= model.parameters(),lr= 0.001)

In [17]:
model = model.to(device)
loss_fn = loss_fn.to(device)

**Training**

Below, helper functions and training procedure are defined.

In [18]:
def accuracy_fn(predictions, labels):  

  correct = (torch.round(torch.sigmoid(predictions)) == batch.label.squeeze(0)).float() 
  acc = correct.sum() / len(correct)
  return acc

In [19]:
import time
# Training loop
N_EPOCHS = 2

tr_loss = []
model.train()

for epoch in range(N_EPOCHS):
    
    # Calculate training time
    start_time = time.time()

    epoch_loss = 0
    epoch_acc = 0
    batch_no = 0

    
    for batch in train_iterator:
        
        # Reset the gradient to not use them in multiple passes 
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        loss = loss_fn(predictions, batch.label.squeeze(0))
        
        # Backprop
        loss.backward()
        
        # Optimize the weights
        optimizer.step()
        
        # Record accuracy and loss
        epoch_loss += loss.item()

        correct = (torch.round(torch.sigmoid(predictions)) == batch.label.squeeze(0)).float() 
        acc = correct.sum() / len(correct)
        epoch_acc +=acc.item()

        batch_no = batch_no +1
        
        if batch_no%60 == 0:
          print(f'Epoch:  {epoch+1:2} | Batch No: {batch_no} | Loss: {loss.item():.3f} | Accuracy: {acc.item()*100:.2f}%')
    
    train_loss = epoch_loss / len(train_iterator)
        
    
    end_time = time.time()

    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    
    # Save training metrics
    tr_loss.append(train_loss)
        
    print(f'Epoch: {epoch+1:2} | Epoch Time: {elapsed_mins}m {elapsed_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} ')

Epoch:   1 | Batch No: 60 | Loss: 0.695 | Accuracy: 57.81%
Epoch:   1 | Batch No: 120 | Loss: 0.611 | Accuracy: 64.06%
Epoch:   1 | Batch No: 180 | Loss: 0.573 | Accuracy: 65.62%
Epoch:   1 | Batch No: 240 | Loss: 0.423 | Accuracy: 84.38%
Epoch:   1 | Batch No: 300 | Loss: 0.493 | Accuracy: 78.12%
Epoch:   1 | Batch No: 360 | Loss: 0.390 | Accuracy: 79.69%
Epoch:  1 | Epoch Time: 0m 25s
	Train Loss: 0.570 
Epoch:   2 | Batch No: 60 | Loss: 0.230 | Accuracy: 90.62%
Epoch:   2 | Batch No: 120 | Loss: 0.189 | Accuracy: 92.19%
Epoch:   2 | Batch No: 180 | Loss: 0.136 | Accuracy: 95.31%
Epoch:   2 | Batch No: 240 | Loss: 0.229 | Accuracy: 90.62%
Epoch:   2 | Batch No: 300 | Loss: 0.270 | Accuracy: 87.50%
Epoch:   2 | Batch No: 360 | Loss: 0.153 | Accuracy: 92.19%
Epoch:  2 | Epoch Time: 0m 24s
	Train Loss: 0.211 


In [20]:
test_epoch_loss = 0
test_epoch_acc = 0

# Turm on evalutaion mode
model.eval()

# No need to backprop in eval
with torch.no_grad():

    for batch in test_iterator:

        test_predictions = model(batch.text).squeeze(1)
        
        test_loss = loss_fn(test_predictions, batch.label)

        test_epoch_loss += test_loss.item()
        
        acc = accuracy_fn(test_predictions,batch.label.squeeze(0))
        test_epoch_acc +=acc.item()

test_loss = test_epoch_loss/len(test_iterator)
test_acc = test_epoch_acc  / len(test_iterator)
print(f'Test Loss: {test_loss:.3f} | | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.453 | | Test Acc: 81.18%


After several trails, I see that GloVe pretrained word vectors improve model performance significantly. Also, preprocessing steps such as stop word removal and stemming improve overall scores.

At  the end, 81.18% is a nice test accuracy score to classify good and bad reviews, using IMDB review dataset.
