# Text Classification with Neural Networks
Build machine learning models to detect the sentiment of movie reviews using the IMDb movie reviews dataset. Specifically,  implement classifiers based on Convolutional Neural Networks (CNN's) and Recurrent Neural Networks (RNN's).

Device:

In [1]:
import torch

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

if __name__=='__main__':
    print('Using device:', DEVICE)

Using device: cuda


---
# Step 1: Download the Data
First I will download the dataset using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch. 

In [2]:
!pip install torchdata

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchdata
  Downloading torchdata-0.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 33.2 MB/s 
Collecting portalocker>=2.0.0
  Downloading portalocker-2.5.1-py2.py3-none-any.whl (15 kB)
Collecting urllib3>=1.25
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 69.2 MB/s 
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 72.1 MB/s 
Installing collected packages: urllib3, portalocker, torchdata
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
Successfully installed portalocker-2.5.1 torchdata-0.4.1 urllib3-1.25.11


Get `train_data` and `test_data` with some basic tokenization:

In [3]:
import torchtext
import random

def preprocess(review):
    res = []
    for x in review.split(' '):
        remove_beg = x[0] in {'(', '"', "'"}
        remove_end = x[-1] in {'.', ',', ';', ':', '?', '!', '"', "'", ')'}
        
        if remove_beg and remove_end: 
            res += [x[0], x[1:-1], x[-1]]
        elif remove_beg:
            res += [x[0], x[1:]]
        elif remove_end: 
            res += [x[:-1], x[-1]]
        else: res += [x]
    return res

if __name__=='__main__':
    train_data = torchtext.datasets.IMDB(root='.data', split='train')
    train_data = list(train_data)
    train_data = [(x[0], preprocess(x[1])) for x in train_data]
    train_data, test_data = train_data[0:10000] + train_data[12500:12500+10000], train_data[10000:12500] + train_data[12500+10000:], 

    print('Num. Train Examples:', len(train_data))
    print('Num. Test Examples:', len(test_data))

    print("\nSAMPLE DATA:")
    for x in random.sample(train_data, 5):
        print('Sample text:', x[1])
        print('Sample label:', x[0], '\n')

Num. Train Examples: 20000
Num. Test Examples: 5000

SAMPLE DATA:
Sample text: ['A', 'remarkable', 'documentary', 'about', 'the', 'landmark', 'achievements', 'of', 'the', 'Women', 'Lawyers', 'Association', '(', 'WLA', ')', 'of', 'Kumba', ',', 'in', 'southwest', 'Cameroon', ',', 'in', 'legally', 'safeguarding', 'the', 'rights', 'of', 'women', 'and', 'children', 'from', 'acts', 'of', 'domestic', 'violence', '.', 'In', 'this', 'Muslim', 'culture', ',', 'where', 'men', 'have', 'always', 'been', 'sovereign', 'over', 'women', ',', 'according', 'to', 'Sharia', 'law', ',', 'one', 'can', 'well', 'imagine', 'the', 'difficulty', 'of', 'imposing', 'secular', 'legal', 'rights', 'for', 'women', 'and', 'children', '.', 'After', '17', 'years', 'of', 'failed', 'efforts', ',', 'leaders', 'of', 'the', 'WLA', 'began', 'recently', 'to', 'score', 'a', 'few', 'wins', ',', 'and', 'the', 'purpose', 'of', 'this', 'film', 'is', 'to', 'share', 'these', 'victorious', 'stories.<br', '/><br', '/>The', 'leaders', 'of

---
# Step 2: Create Dataloader

## Define the Dataset Class

In the following cell, I will define the <b>dataset</b> class. The dataset contains the tokenized data for your model. 

Functions:

*   <b>` build_dictionary(self)`:</b> Creates the dictionaries `idx2word` and `word2idx`.

* <b>`convert_text(self)`:</b> Converts each review in the dataset to a list of indices, given by `word2idx` dictionary.

*   <b>` get_text(self, idx) `:</b> Return the review at `idx` in the dataset as an array of indices corresponding to the words in the review.

*   <b>`get_label(self, idx) `</b>: Return the value `1` if the label for `idx` in the dataset is `positive`, and should return `0` if it is `negative`.

*  <b> ` __len__(self) `:</b> Return the total number of reviews in the dataset as an `int`.

*   <b>` __getitem__(self, idx)`:</b> Return the (padded) text, and the label.

In [4]:
PAD = '<PAD>'
END = '<END>'
UNK = '<UNK>'

In [5]:
from torch.utils import data
from collections import defaultdict


class TextDataset(data.Dataset):
    def __init__(self, examples, split, threshold, max_len, idx2word=None, word2idx=None):
        self.examples = examples
        self.split = split
        self.threshold = threshold
        self.max_len = max_len

        # Dictionaries
        self.idx2word = idx2word
        self.word2idx = word2idx
        if split == 'train':
            self.build_dictionary()
        self.vocab_size = len(self.word2idx)
        
        # Convert text to indices
        self.textual_ids = []
        self.convert_text()

    
    def build_dictionary(self): 
        assert self.split == 'train'
        
        self.idx2word = {0:PAD, 1:END, 2: UNK}
        self.word2idx = {PAD:0, END:1, UNK: 2}
        
        count = {}
        idx = 3
        
        for example in self.examples:
            for word in example[1]:
                word = word.lower()
                if word in count.keys():
                    count[word] += 1
                else:
                    count[word] = 1
                    
        for word in count.keys():
            if count[word] >= self.threshold:
                self.idx2word[idx] = word
                self.word2idx[word] = idx
                idx += 1
        pass
    
    def convert_text(self):
        for example in self.examples:
            example_ids = []
            for word in example[1]:
                if word not in self.word2idx.keys():
                    word = UNK
                    
                example_ids.append(self.word2idx[word])
            example_ids.append(self.word2idx[END])
            self.textual_ids.append([example[0], example_ids])
        pass

    def get_text(self, idx):
        _, ids = self.textual_ids[idx]
        if len(ids) > self.max_len:
            return torch.LongTensor(ids[:self.max_len])
        
        for _ in range(self.max_len - len(ids)):
            ids.append(self.word2idx[PAD])
            
        return torch.LongTensor(ids)
    
    def get_label(self, idx):
        label_map = {'pos': 1, 'neg': 0}
        return torch.tensor(label_map[self.textual_ids[idx][0]])

    def __len__(self):
        return len(self.textual_ids)
    
    def __getitem__(self, idx):
        return self.get_text(idx), self.get_label(idx)

The following cell builds the dataset on the IMDb movie reviews and prints an example:

In [6]:
train_dataset = TextDataset(train_data, 'train', threshold=10, max_len=150)
print('Vocab size:', train_dataset.vocab_size, '\n')

randidx = random.randint(0, len(train_dataset)-1)
text, label = train_dataset[randidx]
print('Example text:')
print(train_data[randidx][1])
print(text)
print('\nExample label:')
print(train_data[randidx][0])
print(label)

Vocab size: 19002 

Example text:
['You', 'know', "you're", 'in', 'for', 'something', 'different', 'when', 'a', 'movie', 'has', 'Christopher', 'Walken', 'playing', 'the', 'part', 'of', 'a', 'professional', 'hit', 'man', '-', 'and', 'he', "isn't", 'even', 'one', 'of', 'the', 'bad', 'guys', '!', 'Although', 'it', 'could', 'do', 'with', 'some', 'judicious', 'trimming', 'here', 'and', 'there', ',', '"', 'Man', 'on', 'Fire', '"', 'is', 'a', 'generally', 'effective', 'crime', 'drama', 'that', 'ranges', 'in', 'tone', 'from', 'the', 'openly', 'sentimental', 'to', 'the', 'downright', 'brutal', '-', 'and', 'just', 'about', 'every', 'tone', 'imaginable', 'in', 'between.<br', '/><br', '/>Denzel', 'Washington', 'stars', 'as', 'Creasy', ',', 'a', 'former', 'CIA', 'assassin', 'who', 'has', 'recently', 'quit', 'the', 'business', 'and', 'is', 'seeking', 'some', 'sort', 'of', 'redemption', 'for', 'the', 'sins', "he's", 'committed', '.', 'So', 'far', ',', "he's", 'been', 'looking', 'for', 'answers', 'in'

---
# Step 3: Train a Convolutional Neural Network (CNN)

## Define the CNN Model
Here I define my convolutional neural network for text classification.

In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class CNN(nn.Module):
    def __init__(self, vocab_size, embed_size, out_channels, filter_heights, stride, dropout, num_classes, pad_idx):
        super(CNN, self).__init__()
      
        self.embedding = nn.Embedding(vocab_size, embed_size, pad_idx)
        self.conv2ds = nn.ModuleList([nn.Conv2d(1, out_channels, kernel_size=(filter_height, embed_size), stride=stride) for filter_height in filter_heights])
        self.dropout = nn.Dropout(p=dropout)
        self.linear = nn.Linear(out_channels * len(self.conv2ds), num_classes)
        self.relu = nn.ReLU()
        

    def forward(self, texts):
        embeddings = self.embedding(texts)

        embeddings = torch.unsqueeze(embeddings, 1)

        outputs = []
        for conv2d in self.conv2ds:
            out = conv2d(embeddings)
            out = torch.squeeze(out, 3)
            out = self.relu(out)
            out = torch.amax(out, dim=2)
            outputs.append(out)
        out = torch.cat(outputs, 1)
        
        out = self.dropout(out)

        out = self.linear(out)

        return out

## Train CNN Model

First, we initialize the train and test <b>dataloaders</b>. A dataloader is responsible for providing batches of data to the model.

In [8]:
THRESHOLD = 5
MAX_LEN = 200
BATCH_SIZE = 32

train_dataset = TextDataset(train_data, 'train', THRESHOLD, MAX_LEN)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, drop_last=True)

test_dataset = TextDataset(test_data, 'test', THRESHOLD, MAX_LEN, train_dataset.idx2word, train_dataset.word2idx)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=1, drop_last=False)

In [9]:
from tqdm.notebook import tqdm


def train_model(model, num_epochs, data_loader, optimizer, criterion):
    print('Training Model...')
    model.train()
    
    for epoch in tqdm(range(num_epochs)):
        epoch_loss = 0
        epoch_acc = 0
        
        for texts, labels in data_loader:
            texts = texts.to(DEVICE)
            labels = labels.to(DEVICE)

            optimizer.zero_grad()

            output = model(texts)
            acc = accuracy(output, labels)
            
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        print('[TRAIN]\t Epoch: {:2d}\t Loss: {:.4f}\t Train Accuracy: {:.2f}%'.format(epoch+1, epoch_loss/len(data_loader), 100*epoch_acc/len(data_loader)))
    print('Model Trained!\n')

Here are some other helper functions:

In [10]:
def accuracy(output, labels):
    preds = output.argmax(dim=1) # find predicted class
    correct = (preds == labels).sum().float() # convert into float for division 
    acc = correct / len(labels)
    return acc

Now we can instantiate the model:

In [11]:
cnn_model = CNN(vocab_size = train_dataset.vocab_size,
            embed_size = 128, 
            out_channels = 64, 
            filter_heights = [2, 3, 4], 
            stride = 1, 
            dropout = 0.5, 
            num_classes = 2,
            pad_idx = train_dataset.word2idx[PAD])

# Put your model on the device (cuda or cpu)
cnn_model = cnn_model.to(DEVICE)

count_parameters = lambda model: sum(p.numel() for p in model.parameters() if p.requires_grad)

print('The model has {:,d} trainable parameters'.format(count_parameters(cnn_model)))

The model has 3,879,746 trainable parameters


Next, I create the **criterion**, which is our loss function: it is a measure of how well the model matches the empirical distribution of the data. I use cross-entropy loss (https://en.wikipedia.org/wiki/Cross_entropy).

I also define the **optimizer**, which performs gradient descent. I use the Adam optimizer (https://arxiv.org/pdf/1412.6980.pdf), which has been shown to work well on these types of models.

In [12]:
import torch.optim as optim


LEARNING_RATE = 5e-5

criterion = nn.CrossEntropyLoss().to(DEVICE)

optimizer = optim.Adam(cnn_model.parameters(), lr=LEARNING_RATE)

Finally, we can train the model:

In [13]:
N_EPOCHS = 20 

train_model(cnn_model, N_EPOCHS, train_loader, optimizer, criterion)

Training Model...


  0%|          | 0/20 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  1	 Loss: 0.8042	 Train Accuracy: 51.77%
[TRAIN]	 Epoch:  2	 Loss: 0.7418	 Train Accuracy: 54.34%
[TRAIN]	 Epoch:  3	 Loss: 0.6919	 Train Accuracy: 57.98%
[TRAIN]	 Epoch:  4	 Loss: 0.6607	 Train Accuracy: 61.27%
[TRAIN]	 Epoch:  5	 Loss: 0.6308	 Train Accuracy: 64.17%
[TRAIN]	 Epoch:  6	 Loss: 0.6082	 Train Accuracy: 66.86%
[TRAIN]	 Epoch:  7	 Loss: 0.5900	 Train Accuracy: 67.91%
[TRAIN]	 Epoch:  8	 Loss: 0.5757	 Train Accuracy: 69.58%
[TRAIN]	 Epoch:  9	 Loss: 0.5600	 Train Accuracy: 71.25%
[TRAIN]	 Epoch: 10	 Loss: 0.5447	 Train Accuracy: 71.94%
[TRAIN]	 Epoch: 11	 Loss: 0.5366	 Train Accuracy: 72.97%
[TRAIN]	 Epoch: 12	 Loss: 0.5256	 Train Accuracy: 73.35%
[TRAIN]	 Epoch: 13	 Loss: 0.5154	 Train Accuracy: 74.60%
[TRAIN]	 Epoch: 14	 Loss: 0.5043	 Train Accuracy: 75.12%
[TRAIN]	 Epoch: 15	 Loss: 0.4988	 Train Accuracy: 75.55%
[TRAIN]	 Epoch: 16	 Loss: 0.4916	 Train Accuracy: 75.77%
[TRAIN]	 Epoch: 17	 Loss: 0.4853	 Train Accuracy: 76.56%
[TRAIN]	 Epoch: 18	 Loss: 0.477

## Evaluate CNN Model 

Now that we have trained a model for text classification, it is time to evaluate it.

In [14]:
import random

 
def evaluate(model, data_loader, criterion, use_tqdm=False):
    print('Evaluating performance on the test dataset...')
    model.eval()
    epoch_loss = 0
    epoch_acc = 0
    all_predictions = []
    print("\nSOME PREDICTIONS FROM THE MODEL:")
    iterator = tqdm(data_loader) if use_tqdm else data_loader
    total = 0
    
    for texts, labels in iterator:
        bs = texts.shape[0]
        total += bs
        texts = texts.to(DEVICE)
        labels = labels.to(DEVICE)
        
        output = model(texts)
        acc = accuracy(output, labels) * len(labels)
        pred = output.argmax(dim=1)
        all_predictions.append(pred)
        
        loss = criterion(output, labels) * len(labels)
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()

        if random.random() < 0.0015 and bs == 1:
            print("Input: "+' '.join([data_loader.dataset.idx2word[idx] for idx in texts[0].tolist() if idx not in {data_loader.dataset.word2idx[PAD], data_loader.dataset.word2idx[END]}]))
            print("Prediction:", pred.item(), '\tCorrect Output:', labels.item(), '\n')

    full_acc = 100*epoch_acc/total
    full_loss = epoch_loss/total
    print('[TEST]\t Loss: {:.4f}\t Accuracy: {:.2f}%'.format(full_loss, full_acc))
    predictions = torch.cat(all_predictions)
    return predictions, full_acc, full_loss

In [15]:
evaluate(cnn_model, test_loader, criterion, use_tqdm=True) # Compute test data accuracy

Evaluating performance on the test dataset...

SOME PREDICTIONS FROM THE MODEL:


  0%|          | 0/5000 [00:00<?, ?it/s]

Input: <UNK> about <UNK> <UNK> <UNK> double <UNK> operation which is believed to have sparked off his obsession with the inner workings of the human body ? <UNK> about " infinity <UNK> ? - <UNK> game he invented as a child which involved stick men being annihilated when they came too close to one another , suggesting that intimacy was the ultimate danger . <UNK> about the relationship between his parents , and the emotional problems of his mother that were far more relevant than just his own relationship with his father ? <UNK> feelings of neglect when his brother was born ? <UNK> about his fascination with insects and animals ? <UNK> he would dissect roadkill and hang it up in the woods behind his <UNK> about focusing more on his cannibalism ? <UNK> what about his parent's divorce ? <UNK> are all things that should have been included in the film . <UNK> the film maker chose to give us a watered down ' snapshot ' from a night or two in his life , and combine it with series of confusing

(tensor([0, 0, 0,  ..., 0, 1, 1], device='cuda:0'), 79.16, 0.44462695379832295)

---
# Step 4: Train a Recurrent Neural Network (RNN)
I will now build a text clasification model that is based on **recurrences**.

## Define the RNN Model

First, I will define the RNN:

In [16]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout, num_classes, pad_idx):
        super(RNN, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bidirectional = bidirectional

        self.embedding = nn.Embedding(vocab_size, embed_size, pad_idx)
        self.gru = nn.GRU(embed_size, hidden_size, num_layers, batch_first=True, dropout=dropout, bidirectional=bidirectional)
        self.dropout = nn.Dropout(p=dropout)

        if bidirectional:
            self.linear = nn.Linear(hidden_size*2, num_classes)
        else:
            self.linear = nn.Linear(hidden_size, num_classes)


    def forward(self, texts):
        D = 2 if self.bidirectional else 1
        batch_size, max_len = list(texts.size())
        
        embeddings = self.embedding(texts)
        
        h0 = torch.zeros(self.num_layers * D, batch_size, self.hidden_size, device=texts.device)
        out, hid = self.gru(embeddings, h0)
        
        if self.bidirectional:
            out = torch.cat((hid[-2,:,:], hid[-1,:,:]), dim = 1)
        else:
            out = hid[-1,:,:]

        out = self.dropout(out)
        out = self.linear(out)

        return out

## Train RNN Model
First, we initialize the train and test dataloaders:

In [17]:
THRESHOLD = 5 
MAX_LEN = 200 
BATCH_SIZE = 32 

train_dataset = TextDataset(train_data, 'train', THRESHOLD, MAX_LEN)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, drop_last=True)

test_dataset = TextDataset(test_data, 'test', THRESHOLD, MAX_LEN, train_dataset.idx2word, train_dataset.word2idx)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=1, drop_last=False)

Now I can instantiate the model:

In [18]:
rnn_model = RNN(vocab_size = train_dataset.vocab_size,
            embed_size = 128, 
            hidden_size = 128, 
            num_layers = 2,
            bidirectional = True,
            dropout = 0.5,
            num_classes = 2,
            pad_idx = train_dataset.word2idx[PAD])

rnn_model = rnn_model.to(DEVICE)

print('The model has {:,d} trainable parameters'.format(count_parameters(rnn_model)))

The model has 4,300,546 trainable parameters


Here, we create the criterion and optimizer; as with the CNN, I use cross-entropy loss and Adam optimization:

In [19]:
LEARNING_RATE = 5e-4 

criterion = nn.CrossEntropyLoss().to(DEVICE)

optimizer = optim.Adam(rnn_model.parameters(), lr=LEARNING_RATE)

Finally, we can train the model. I use the same `train_model(...)` function that I defined for the CNN:

In [20]:
N_EPOCHS = 12 

train_model(rnn_model, N_EPOCHS, train_loader, optimizer, criterion)

Training Model...


  0%|          | 0/12 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  1	 Loss: 0.6665	 Train Accuracy: 58.33%
[TRAIN]	 Epoch:  2	 Loss: 0.5243	 Train Accuracy: 74.38%
[TRAIN]	 Epoch:  3	 Loss: 0.3746	 Train Accuracy: 83.86%
[TRAIN]	 Epoch:  4	 Loss: 0.2839	 Train Accuracy: 88.52%
[TRAIN]	 Epoch:  5	 Loss: 0.2201	 Train Accuracy: 91.66%
[TRAIN]	 Epoch:  6	 Loss: 0.1583	 Train Accuracy: 94.33%
[TRAIN]	 Epoch:  7	 Loss: 0.1053	 Train Accuracy: 96.30%
[TRAIN]	 Epoch:  8	 Loss: 0.0656	 Train Accuracy: 97.89%
[TRAIN]	 Epoch:  9	 Loss: 0.0498	 Train Accuracy: 98.42%
[TRAIN]	 Epoch: 10	 Loss: 0.0370	 Train Accuracy: 98.84%
[TRAIN]	 Epoch: 11	 Loss: 0.0266	 Train Accuracy: 99.12%
[TRAIN]	 Epoch: 12	 Loss: 0.0285	 Train Accuracy: 99.04%
Model Trained!



## Evaluate RNN Model
Now we can evaluate the RNN.

In [21]:
evaluate(rnn_model, test_loader, criterion, use_tqdm=True) # Compute test data accuracy

Evaluating performance on the test dataset...

SOME PREDICTIONS FROM THE MODEL:


  0%|          | 0/5000 [00:00<?, ?it/s]

Input: <UNK> was fascinated as to how truly bad this movie was . <UNK> the viewer supposed to learn something , or reflect on anything here ? <UNK> was up with the <UNK> ? <UNK> <UNK> supposed to be impressed with the motel shots ? <UNK> it matter that there are some garbage bags on a rooftop across the street of a hotel ? <UNK> does the narrator unsuccessfully mock the people he interviews ( it is so obvious that he edited out the really informative parts of his interviews to achieve <UNK> . <UNK> best part of the movie was the interview with the film professor who tells us how bad this movie will be even before it is <UNK> /><br <UNK> am truly amazed . <UNK> believe that the creator is struggling to become an intellectual or is trying to impress the intellectual community .
Prediction: 0 	Correct Output: 0 

Input: <UNK> main comment on this movie is how <UNK> was able to get credible actors to work on this movie ? <UNK> cast  even for the supporting characters , none of which helps

(tensor([0, 0, 0,  ..., 1, 0, 1], device='cuda:0'), 81.58, 0.9718203727527216)