# Text Classification with Neural Networks

In this project, we will build machine learning models to detect the sentiment of movie reviews using the IMDb movie reviews dataset. Specifically, implement classifiers based on Convolutional Neural Networks (CNN's) and Recurrent Neural Networks (RNN's).

In [None]:
import torch

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

if __name__=='__main__':
    print('Using device:', DEVICE)

Using device: cuda


# Step 1: Download the Data
First the dataset is downloaded using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch. 

In [None]:
import torchtext
import random

def preprocess(review):
    '''
    Simple preprocessing function.
    '''
    res = []
    for x in review.split(' '):
        remove_beg=True if x[0] in {'(', '"', "'"} else False
        remove_end=True if x[-1] in {'.', ',', ';', ':', '?', '!', '"', "'", ')'} else False
        if remove_beg and remove_end: res += [x[0], x[1:-1], x[-1]]
        elif remove_beg: res += [x[0], x[1:]]
        elif remove_end: res += [x[:-1], x[-1]]
        else: res += [x]
    return res

if __name__=='__main__':
    train_data = torchtext.datasets.IMDB(root='.data', split='train')
    train_data = list(train_data)
    train_data = [(x[0], preprocess(x[1])) for x in train_data]
    train_data, test_data = train_data[0:10000] + train_data[12500:12500+10000], train_data[10000:12500] + train_data[12500+10000:], 

    print('Num. Train Examples:', len(train_data))
    print('Num. Test Examples:', len(test_data))

    print("\nSAMPLE DATA:")
    for x in random.sample(train_data, 5):
        print('Sample text:', x[1])
        print('Sample label:', x[0], '\n')
    print(train_data[5:7])    

Num. Train Examples: 20000
Num. Test Examples: 5000

SAMPLE DATA:
Sample text: ['You', 'have', 'to', 'admire', 'Brad', 'Sykes', 'even', 'if', 'you', "don't", 'particularly', 'want', 'to', ',', 'a', 'man', 'who', 'churns', 'out', 'budget', 'horror', 'after', 'budget', 'horror', 'to', 'less', 'than', 'enthusiastic', 'receptions', '.', 'But', 'keeps', 'on', 'doing', 'it', 'all', 'the', 'same', '.', 'Even', 'the', 'half-hearted', 'praise', 'than', 'surrounds', 'his', 'Camp', 'Blood', 'films', 'is', 'given', 'grudgingly', 'and', "I'm", 'as', 'guilty', 'of', 'this', 'as', 'anyone', '.', 'Brad', 'normally', 'manages', 'to', 'throw', 'something', 'interesting', 'into', 'the', 'mix', ',', 'a', 'neat', 'idea', ',', 'a', 'kooky', 'character', ',', 'whatever', ',', 'but', 'without', 'the', 'funds', 'to', 'take', 'it', 'further', 'than', 'base', 'level', ',', 'he', 'relies', 'on', 'the', 'audience', 'to', 'cut', 'him', 'some', 'slack', 'and', 'appreciate', 'it', 'for', 'what', 'it', 'is', 'and', 'w

# Step 2: Create Dataloader




## Define the Dataset Class

In the following cell, we will define the <b>dataset</b> class. The dataset contains the tokenized data for your model. The following functions are implemented: 

*   <b>` build_dictionary(self)`:</b> Creates the dictionaries `idx2word` and `word2idx`. You will represent each word in the dataset with a unique index, and keep track of this in these dictionaries. Use the hyperparameter `threshold` to control which words appear in the dictionary: a training word’s frequency should be `>= threshold` to be included in the dictionary.

* <b>`convert_text(self)`:</b> Converts each review in the dataset to a list of indices, given by your `word2idx` dictionary. You should store this in the `textual_ids` variable, and the function does not return anything. If a word is not present in the  `word2idx` dictionary, you should use the `<UNK>` token for that word. Be sure to append the `<END>` token to the end of each review.

*   <b>` get_text(self, idx) `:</b> Return the review at `idx` in the dataset as an array of indices corresponding to the words in the review. If the length of the review is less than `max_len`, you should pad the review with the `<PAD>` character up to the length of `max_len`. If the length is greater than `max_len`, then it should only return the first `max_len` words. The return type should be `torch.LongTensor`.

*   <b>`get_label(self, idx) `</b>: Return the value `1` if the label for `idx` in the dataset is `positive`, and should return `0` if it is `negative`. The return type should be `torch.LongTensor`.

*  <b> ` __len__(self) `:</b> Return the total number of reviews in the dataset as an `int`.

*   <b>` __getitem__(self, idx)`:</b> Return the (padded) text, and the label. The return type for both these items should be `torch.LongTensor`. You should use the ` get_label(self, idx) ` and ` get_text(self, idx) ` functions here.

In [None]:
PAD = '<PAD>'
END = '<END>'
UNK = '<UNK>'

from torch.utils import data
from collections import defaultdict

class TextDataset(data.Dataset):
    def __init__(self, examples, split, threshold, max_len, idx2word=None, word2idx=None):        
        self.examples = examples
        assert split in {'train', 'val', 'test'}
        self.split = split
        self.threshold = threshold
        self.max_len = max_len

        # Dictionaries
        self.idx2word = idx2word
        self.word2idx = word2idx
        if split == 'train':
            self.build_dictionary()
        self.vocab_size = len(self.word2idx)
        
        # Convert text to indices
        self.textual_ids = []
        self.convert_text()

    
    def build_dictionary(self): 
        '''
        Build the dictionaries idx2word and word2idx. This is only called when split='train', as these
        dictionaries are passed in to the __init__(...) function otherwise. Be sure to use self.threshold
        to control which words are assigned indices in the dictionaries.
        Returns nothing.
        '''
        assert self.split == 'train'
        word_theshold_count={}
        # Don't change this
        self.idx2word = {0:PAD, 1:END, 2: UNK}
        self.word2idx = {PAD:0, END:1, UNK: 2}
        indx=2
        for x in self.examples:
          for word in x[1]:
            word=word.lower()
            word_theshold_count[word]=word_theshold_count.get(word,0)+1
            if word_theshold_count.get(word,0)>= self.threshold and self.word2idx.get(word,0)==0:
              indx+=1
              self.idx2word[indx]=word
              self.word2idx[word]=indx
        
    
    def convert_text(self):
        '''
        Convert each review in the dataset (self.examples) to a list of indices, given by self.word2idx.
        Store this in self.textual_ids; returns nothing.
        '''
        for x in self.examples:
          review=[]
          for word in x[1]:
            word=word.lower()
            indx=self.word2idx.get(word,"not present")
            if  indx!="not present":
              review.append(indx)
            else:
                review.append(self.word2idx.get(UNK))
          review.append(self.word2idx.get(END))
          self.textual_ids.append(review)   
      

    def get_text(self, idx):
        '''
        Return the review at idx as a long tensor (torch.LongTensor) of integers corresponding to the words in the review.
        You may need to pad as necessary (see above).
        '''
        review=self.textual_ids[idx]
        if len(self.textual_ids[idx])<self.max_len:
          for i in range(len(self.textual_ids[idx]),self.max_len):
            review.append(self.word2idx.get(PAD))
        else:
          review=review[:self.max_len]    
        return torch.LongTensor(review)
    
    def get_label(self, idx):
        '''
        This function should return the value 1 if the label for idx in the dataset is 'positive', 
        and 0 if it is 'negative'. The return type should be torch.LongTensor.
        '''
        if self.examples[idx][0]=="neg":
          return torch.squeeze(torch.LongTensor([0]))

        return torch.squeeze(torch.LongTensor([1]))

    def __len__(self):
        '''
        Return the number of reviews (int value) in the dataset
        '''
        return len(self.examples)
    
    def __getitem__(self, idx):
        '''
        Return the review, and label of the review specified by idx.
        '''
        return self.get_text(idx),self.get_label(idx)

In [None]:
def sanityCheckDataSet():
    #	Read in the sample corpus
    reviews = [('pos', 'Your life is good when you have money, success and health'),
               ('neg', 'Life is bad when you got not a lot')]
    data = [(x[0], preprocess(x[1])) for x in reviews]
    print("Sample dataset:")
    for x in data: print(x)

    thresholds = [1,2,3]
    print('\n--- TEST: idx2word and word2idx dictionaries ---') # max_len does not matter for this test
    correct = [[',', '<END>', '<PAD>', '<UNK>', 'a', 'and', 'bad', 'good', 'got', 'have', 'health', 'is', 'life', 'lot', 'money', 'not', 'success', 'when', 'you', 'your'], ['<END>', '<PAD>', '<UNK>', 'is', 'life', 'when', 'you'], ['<END>', '<PAD>', '<UNK>']]
    for i in range(len(thresholds)):
        dataset = TextDataset(data, 'train', threshold=thresholds[i], max_len=3)

        has_passed, message = True, ''
        if has_passed and (dataset.vocab_size != len(dataset.word2idx) or dataset.vocab_size != len(dataset.idx2word)):
            has_passed, message = False, 'dataset.vocab_size (' + str(dataset.vocab_size) + ') must be the same length as dataset.word2idx (' + str(len(dataset.word2idx)) + ') and dataset.idx2word ('+str(len(dataset.idx2word)) +').'
        if has_passed and (dataset.vocab_size != len(correct[i])):
            has_passed, message = False, 'Your vocab size is incorrect. Expected: ' + str(len(correct[i])) + '\tGot: ' + str(dataset.vocab_size)
        if has_passed and sorted(list(dataset.idx2word.keys())) != list(range(0, dataset.vocab_size)):
            has_passed, message = False, 'dataset.idx2word must have keys ranging from 0 to dataset.vocab_size-1. Keys in your dataset.idx2word: ' + str(sorted(list(dataset.idx2word.keys())))
        if has_passed and sorted(list(dataset.word2idx.keys())) != correct[i]:
            has_passed, message = False, 'Your dataset.word2idx has incorrect keys. Expected: ' + str(correct[i]) + '\tGot: ' + str(sorted(list(dataset.word2idx.keys())))
        if has_passed: # Check that word2idx and idx2word are consistent
            widx = sorted(list(dataset.word2idx.items())) 
            idxw = sorted(list([(v,k) for k,v in dataset.idx2word.items()]))
            if not (len(widx) == len(idxw) and all([widx[q] == idxw[q] for q in range(len(widx))])):
                has_passed, message = False, 'Your dataset.word2idx and dataset.idx2word are not consistent. dataset.idx2word: ' + str(dataset.idx2word) + '\tdataset.word2idx: ' + str(dataset.word2idx)

        status = 'PASSED' if has_passed else 'FAILED'
        print('\tthreshold:', thresholds[i], '\tmax_len:', 3, '\t'+status, '\t'+message)
    
    print('\n--- TEST: len(dataset) ---')
    has_passed = len(dataset) == 2
    if has_passed: print('\tPASSED')
    else: print('\tlen(dataset) is incorrect. Expected: 2\tGot: ' + str(len(dataset)))

    print('\n--- TEST: __getitem__(self, idx) ---')
    max_lens = [3,8,15]
    idxes = [0,1]
    combos = [{'threshold': t, 'max_len': m, 'idx': idx} for t in thresholds for m in max_lens for idx in idxes]
    correct = [(torch.tensor([3, 4, 5]), torch.tensor(1)), (torch.tensor([ 4,  5, 15]), torch.tensor(0)), (torch.tensor([ 3,  4,  5,  6,  7,  8,  9, 10]), torch.tensor(1)), (torch.tensor([ 4,  5, 15,  7,  8, 16, 17, 18]), torch.tensor(0)), (torch.tensor([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,  1,  0,  0]), torch.tensor(1)), (torch.tensor([ 4,  5, 15,  7,  8, 16, 17, 18, 19,  1,  0,  0,  0,  0,  0]), torch.tensor(0)), (torch.tensor([2, 3, 4]), torch.tensor(1)), (torch.tensor([3, 4, 2]), torch.tensor(0)), (torch.tensor([2, 3, 4, 2, 5, 6, 2, 2]), torch.tensor(1)), (torch.tensor([3, 4, 2, 5, 6, 2, 2, 2]), torch.tensor(0)), (torch.tensor([2, 3, 4, 2, 5, 6, 2, 2, 2, 2, 2, 2, 1, 0, 0]), torch.tensor(1)), (torch.tensor([3, 4, 2, 5, 6, 2, 2, 2, 2, 1, 0, 0, 0, 0, 0]), torch.tensor(0)), (torch.tensor([2, 2, 2]), torch.tensor(1)), (torch.tensor([2, 2, 2]), torch.tensor(0)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2]), torch.tensor(1)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2]), torch.tensor(0)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0]), torch.tensor(1)), (torch.tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0, 0, 0, 0]), torch.tensor(0))]
    for i in range(len(combos)):
        combo = combos[i]
        dataset = TextDataset(data, 'train', threshold=combo['threshold'], max_len=combo['max_len'])
        returned = dataset.__getitem__(combo['idx'])

        has_passed, message = True, ''
        if has_passed and len(returned) != 2:
            has_passed, message = False, 'dataset.__getitem__(idx) must return 2 things. Got ' + str(len(returned)) +' things instead.'
        if has_passed and (type(returned[0]) != torch.Tensor or type(returned[1]) != torch.Tensor):
            has_passed, message = False, 'Both returns must be of type torch.Tensor. Got: (' + str(type(returned[0])) + ', ' + str(type(returned[1])) + ')'
        if has_passed and (returned[0].shape != correct[i][0].shape):
            has_passed, message = False, 'Shape of first return is incorrect. Expected: ' + str(correct[i][0].shape) + '.\tGot: ' + str(returned[0].shape)
        if has_passed and (returned[1].shape != correct[i][1].shape):
            has_passed, message = False, 'Shape of second return is incorrect. Expected: ' + str(correct[i][1].shape) + '.\tGot: ' + str(returned[1].shape) + '\n\t\tHint: torch.Size([]) means that the tensor should be dimensionless (just a number). Try squeezing your result.'
        if has_passed and (returned[1] != correct[i][1]):
            has_passed, message = False, 'Label (second return) is incorrect. Expected: ' + str(correct[i][1]) + '.\tGot: ' + str(returned[1])
        if has_passed:
            correct_padding_idxes, your_padding_idxes = torch.where(correct[i][0] == 0)[0], torch.where(returned[0] == dataset.word2idx[PAD])[0]
            if not (correct_padding_idxes.shape == your_padding_idxes.shape and torch.all(correct_padding_idxes == your_padding_idxes)):
                has_passed, message = False, 'Padding is not correct. Expected padding indxes: ' + str(correct_padding_idxes) + '.\tYour padding indexes: ' + str(your_padding_idxes)

        status = 'PASSED' if has_passed else 'FAILED'
        print('\tthreshold:', combo['threshold'], '\tmax_len:', combo['max_len'] , '\tidx:', combo['idx'], '\t'+status, '\t'+message)

if __name__ == '__main__':
    sanityCheckDataSet()

Sample dataset:
('pos', ['Your', 'life', 'is', 'good', 'when', 'you', 'have', 'money', ',', 'success', 'and', 'health'])
('neg', ['Life', 'is', 'bad', 'when', 'you', 'got', 'not', 'a', 'lot'])

--- TEST: idx2word and word2idx dictionaries ---
	threshold: 1 	max_len: 3 	PASSED 	
	threshold: 2 	max_len: 3 	PASSED 	
	threshold: 3 	max_len: 3 	PASSED 	

--- TEST: len(dataset) ---
	PASSED

--- TEST: __getitem__(self, idx) ---
	threshold: 1 	max_len: 3 	idx: 0 	PASSED 	
	threshold: 1 	max_len: 3 	idx: 1 	PASSED 	
	threshold: 1 	max_len: 8 	idx: 0 	PASSED 	
	threshold: 1 	max_len: 8 	idx: 1 	PASSED 	
	threshold: 1 	max_len: 15 	idx: 0 	PASSED 	
	threshold: 1 	max_len: 15 	idx: 1 	PASSED 	
	threshold: 2 	max_len: 3 	idx: 0 	PASSED 	
	threshold: 2 	max_len: 3 	idx: 1 	PASSED 	
	threshold: 2 	max_len: 8 	idx: 0 	PASSED 	
	threshold: 2 	max_len: 8 	idx: 1 	PASSED 	
	threshold: 2 	max_len: 15 	idx: 0 	PASSED 	
	threshold: 2 	max_len: 15 	idx: 1 	PASSED 	
	threshold: 3 	max_len: 3 	idx: 0 	PASSED 	

The following cell builds the dataset on the IMDb movie reviews and prints an example:

In [None]:
if __name__=='__main__':
    train_dataset = TextDataset(train_data, 'train', threshold=10, max_len=150)
    print('Vocab size:', train_dataset.vocab_size, '\n')

    randidx = random.randint(0, len(train_dataset)-1)
    text, label = train_dataset[randidx]
    print('Example text:')
    print(train_data[randidx][1])
    print(text)
    print('\nExample label:')
    print(train_data[randidx][0])
    print(label)

Vocab size: 19002 

Example text:
['Ronald', 'Colman', 'plays', 'a', 'famous', 'Broadway', 'actor', 'who', 'has', 'begun', 'to', 'lose', 'his', 'mind', 'and', 'sense', 'of', 'identity', '.', 'After', 'years', 'of', 'playing', 'a', 'wide', 'range', 'of', 'parts', ',', 'he', "can't", 'remember', 'who', 'he', 'exactly', 'is--who', 'are', 'his', 'roles', 'and', 'who', 'is', 'the', 'self', '.', 'And', ',', 'much', 'more', 'serious', ',', 'he', 'begins', 'to', 'see', 'and', 'hear', 'his', 'play', 'even', 'in', 'regular', 'everyday', 'life', '.', 'So', ',', 'since', "he's", 'currently', 'playing', 'in', '"', 'Othello"', ',', 'he', 'begins', 'to', 'act', 'jealous', 'and', 'suspicious--just', 'like', 'the', 'title', 'character', '.', 'Ultimately', ',', 'it', 'leads', 'him', 'to', 'the', 'depths', 'of', 'insanity', 'and', 'murder.<br', '/><br', '/>I', 'saw', 'this', 'film', 'years', 'ago', 'and', 'liked', 'it', '.', 'I', 'just', 'saw', 'it', 'again', 'and', 'loved', 'it', '.', 'Now', 'perhaps', 

# Step 3: Train a Convolutional Neural Network (CNN)

## Define the CNN Model 
Here we will define convolutional neural network for text classification. 

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, vocab_size, embed_size, out_channels, filter_heights, stride, dropout, num_classes, pad_idx):
        super(CNN, self).__init__()
        
        # Create an embedding layer (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
        #   to represent the words in your vocabulary. Make sure to use vocab_size, embed_size, and pad_idx here.
        self.embeddding= nn.Embedding(vocab_size, embed_size, pad_idx, max_norm=None, 
                           norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, device=None, dtype=None)

        # Define multiple Convolution layers (nn.Conv2d) with filter (kernel) size [filter_height, embed_size] based on your 
        #   different filter_heights.
        in_channels=1
        self.convs = nn.ModuleList([nn.Conv2d(in_channels, out_channels, (filter_ht, embed_size)) for filter_ht in filter_heights])
        # Input channels will be 1 and output channels will be out_channels (these many different filters will be trained 
        #   for each convolution layer)
        # If you want, you can store a list of modules inside nn.ModuleList.
        # Note: even though your conv layers are nn.Conv2d, we are doing a 1d convolution since we are only moving the filter 
        #   in one direction

        # Create a dropout layer (nn.Dropout) using dropout
        self.dropout = nn.Dropout(dropout)
        # Define a linear layer (nn.Linear) that consists of num_classes units 
        #   and takes as input the concatenated output for all cnn layers (out_channels * num_of_cnn_layers units)
        self.fc1 = nn.Linear(len(filter_heights) * out_channels, num_classes)

    def forward(self, texts):
        """
        texts: LongTensor [batch_size, max_len]
        
        Returns output: Tensor [batch_size, num_classes]
        """

        # Pass texts through your embedding layer to convert from word ids to word embeddings
        #   Resulting: shape: [batch_size, max_len, embed_size]
        word_embeddings=self.embeddding(texts)
        # Input to conv should have 1 channel. Take a look at torch's unsqueeze() function
        #   Resulting shape: [batch_size, 1, MAX_LEN, embed_size]
        word_embeddings = word_embeddings.unsqueeze(1)
        # Pass these texts to each of your conv layers and compute their output as follows:
        #   Your cnn output will have shape [batch_size, out_channels, *, 1] where * depends on filter_height and stride
        #   Convert to shape [batch_size, out_channels, *] (see torch's squeeze() function)
        #   Apply non-linearity on it (F.relu() is a commonly used one. Feel free to try others)
        #   Take the max value across last dimension to have shape [batch_size, out_channels]
        # Concatenate (torch.cat) outputs from all your cnns [batch_size, (out_channels*num_of_cnn_layers)]
        #
        output = [F.relu(conv(word_embeddings)).squeeze(3) for conv in self.convs]
        # Let's understand what you just did:
        #   Since each cnn is of different filter_height, it will look at different number of words at a time
        #     So, a filter_height of 3 means your cnn looks at 3 words (3-grams) at a time and tries to extract some information from it
        #   Each cnn will learn out_channels number of features from the words it sees at a time
        #   Then you applied a non-linearity and took the max value for all channels
        #     You are essentially trying to find important n-grams from the entire text
        # Everything happens on a batch simultaneously hence you have that additional batch_size as the first dimension
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in output]
        # Apply dropout
        x = torch.cat(x, 1)
        x = self.dropout(x)
        logit = self.fc1(x)
        # Pass your output through the linear layer and return its output 
        #   Resulting shape: [batch_size, num_classes]

        ##### NOTE: Do not apply a sigmoid or softmax to the final output - done in training method!

        return logit

In [None]:
count_parameters = lambda model: sum(p.numel() for p in model.parameters() if p.requires_grad)

def sanityCheckModel(all_test_params, NN, expected_outputs, init_or_forward, data_loader):
    print('--- TEST: ' + ('Number of Model Parameters (tests __init__(...))' if init_or_forward=='init' else 'Output shape of forward(...)') + ' ---')
    
    if init_or_forward == "forward":
        # Reading the first batch of data for testing
        for texts_, labels_ in data_loader:
            texts_batch, labels_batch = texts_, labels_
            break

    for tp_idx, (test_params, expected_output) in enumerate(zip(all_test_params, expected_outputs)):       
        if init_or_forward == "forward":
            batch_size = test_params['batch_size']
            texts = texts_batch[:batch_size]

        # Construct the student model
        tps = {k:v for k, v in test_params.items() if k != 'batch_size'}
        stu_nn = NN(**tps)

        if init_or_forward == "forward":
            with torch.no_grad(): 
                stu_out = stu_nn(texts)
            ref_out_shape = expected_output

            has_passed = torch.is_tensor(stu_out)
            if not has_passed: msg = 'Output must be a torch.Tensor; received ' + str(type(stu_out))
            else: 
                has_passed = stu_out.shape == ref_out_shape
                msg = 'Your Output Shape: ' + str(stu_out.shape)
            

            status = 'PASSED' if has_passed else 'FAILED'
            message = '\t' + status + "\t Init Input: " + str({k:v for k,v in tps.items()}) + '\tForward Input Shape: ' + str(texts.shape) + '\tExpected Output Shape: ' + str(ref_out_shape) + '\t' + msg
            print(message)
        else:
            stu_num_params = count_parameters(stu_nn)
            ref_num_params = expected_output
            comparison_result = (stu_num_params == ref_num_params)

            status = 'PASSED' if comparison_result else 'FAILED'
            message = '\t' + status + "\tInput: " + str({k:v for k,v in test_params.items()}) + ('\tExpected Num. Params: ' + str(ref_num_params) + '\tYour Num. Params: '+ str(stu_num_params))
            print(message)

        del stu_nn


if __name__ == '__main__':
    # Test init
    inputs = [{'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}]
    expected_outputs = [22434, 22531, 22434, 22531, 23874, 23939, 23874, 23939, 41730, 42115, 41730, 42115, 47490, 47747, 47490, 47747, 44578, 44675, 44578, 44675, 47554, 47619, 47554, 47619, 82306, 82691, 82306, 82691, 94210, 94467, 94210, 94467]

    sanityCheckModel(inputs, CNN, expected_outputs, "init", None)
    print()

    # Test forward
    inputs = [{'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 32, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 32, 'out_channels': 128, 'filter_heights': [5, 10], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 20}]
    expected_outputs = [torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2]), torch.Size([1, 2]), torch.Size([20, 2])]
    sanity_dataset = TextDataset(train_data, 'train', 5, 150)
    sanity_loader = torch.utils.data.DataLoader(sanity_dataset, batch_size=50, shuffle=True, num_workers=2, drop_last=True)

    sanityCheckModel(inputs, CNN, expected_outputs, "forward", sanity_loader)

--- TEST: Number of Model Parameters (tests __init__(...)) ---
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}	Expected Num. Params: 22434	Your Num. Params: 22434
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 1, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}	Expected Num. Params: 22531	Your Num. Params: 22531
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}	Expected Num. Params: 22434	Your Num. Params: 22434
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter_heights': [3, 4, 5], 'stride': 3, 'dropout': 0, 'num_classes': 3, 'pad_idx': 0}	Expected Num. Params: 22531	Your Num. Params: 22531
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'out_channels': 32, 'filter

## Train CNN Model

First, we initialize the train and test <b>dataloaders</b>. A dataloader is responsible for providing batches of data to your model. Notice how we first instantiate datasets for the train and test data, and that we use the training vocabulary for both.

In [None]:
if __name__=='__main__':
    THRESHOLD = 5 
    MAX_LEN = 200 
    BATCH_SIZE = 32 
    train_dataset = TextDataset(train_data, 'train', THRESHOLD, MAX_LEN)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, drop_last=True)

    test_dataset = TextDataset(test_data, 'test', THRESHOLD, MAX_LEN, train_dataset.idx2word, train_dataset.word2idx)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=1, drop_last=False)

In [None]:
from tqdm.notebook import tqdm

def train_model(model, num_epochs, data_loader, optimizer, criterion):
    print('Training Model...')
    model.train()
    for epoch in tqdm(range(num_epochs)):
        epoch_loss = 0
        epoch_acc = 0
        for texts, labels in data_loader:
            texts = texts.to(DEVICE) # shape: [batch_size, MAX_LEN]
            labels = labels.to(DEVICE) # shape: [batch_size]

            optimizer.zero_grad()

            output = model(texts)
            acc = accuracy(output, labels)
            
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        print('[TRAIN]\t Epoch: {:2d}\t Loss: {:.4f}\t Train Accuracy: {:.2f}%'.format(epoch+1, epoch_loss/len(data_loader), 100*epoch_acc/len(data_loader)))
    print('Model Trained!\n')

In [None]:
def accuracy(output, labels):
    """
    Returns accuracy per batch
    output: Tensor [batch_size, n_classes]
    labels: LongTensor [batch_size]
    """
    preds = output.argmax(dim=1) # find predicted class
    correct = (preds == labels).sum().float() # convert into float for division 
    acc = correct / len(labels)
    return acc

In [None]:
if __name__=='__main__':
    cnn_model = CNN(vocab_size = train_dataset.vocab_size, # Don't change this
                embed_size = 128, 
                out_channels = 64, 
                filter_heights = [2, 3, 4], 
                stride = 1, 
                dropout = 0.5, 
                num_classes = 2, 
                pad_idx = train_dataset.word2idx[PAD]) 

    # Put the model on the device (cuda or cpu)
    cnn_model = cnn_model.to(DEVICE)
    
    print('The model has {:,d} trainable parameters'.format(count_parameters(cnn_model)))

The model has 3,879,746 trainable parameters


In [None]:
import torch.optim as optim

if __name__=='__main__':    
    LEARNING_RATE = 5e-4 #learning rates

    # Define the loss function
    criterion = nn.CrossEntropyLoss().to(DEVICE)

    # Define the optimizer
    optimizer = optim.Adam(cnn_model.parameters(), lr=LEARNING_RATE)

Finally, we can train the model. 

In [None]:
if __name__=='__main__':    
    N_EPOCHS = 10 
    
    # train model for N_EPOCHS epochs
    train_model(cnn_model, N_EPOCHS, train_loader, optimizer, criterion)

Training Model...


  0%|          | 0/10 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  1	 Loss: 0.6923	 Train Accuracy: 60.13%
[TRAIN]	 Epoch:  2	 Loss: 0.5603	 Train Accuracy: 71.09%
[TRAIN]	 Epoch:  3	 Loss: 0.4970	 Train Accuracy: 75.81%
[TRAIN]	 Epoch:  4	 Loss: 0.4513	 Train Accuracy: 78.62%
[TRAIN]	 Epoch:  5	 Loss: 0.4193	 Train Accuracy: 80.74%
[TRAIN]	 Epoch:  6	 Loss: 0.3761	 Train Accuracy: 83.27%
[TRAIN]	 Epoch:  7	 Loss: 0.3420	 Train Accuracy: 85.22%
[TRAIN]	 Epoch:  8	 Loss: 0.3037	 Train Accuracy: 87.21%
[TRAIN]	 Epoch:  9	 Loss: 0.2575	 Train Accuracy: 89.10%
[TRAIN]	 Epoch: 10	 Loss: 0.2246	 Train Accuracy: 90.99%
Model Trained!



## Evaluate CNN Model

Now that we have trained a model for text classification, it is time to evaluate it.

In [None]:
import random

def evaluate(model, data_loader, criterion, use_tqdm=False):
    print('Evaluating performance on the test dataset...')
    model.eval()
    epoch_loss = 0
    epoch_acc = 0
    all_predictions = []
    print("\nSOME PREDICTIONS FROM THE MODEL:")
    iterator = tqdm(data_loader) if use_tqdm else data_loader
    total = 0
    for texts, labels in iterator:
        bs = texts.shape[0]
        total += bs
        texts = texts.to(DEVICE)
        labels = labels.to(DEVICE)
        
        output = model(texts)
        acc = accuracy(output, labels) * len(labels)
        pred = output.argmax(dim=1)
        all_predictions.append(pred)
        
        loss = criterion(output, labels) * len(labels)
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()

        if random.random() < 0.0015 and bs == 1:
            print("Input: "+' '.join([data_loader.dataset.idx2word[idx] for idx in texts[0].tolist() if idx not in {data_loader.dataset.word2idx[PAD], data_loader.dataset.word2idx[END]}]))
            print("Prediction:", pred.item(), '\tCorrect Output:', labels.item(), '\n')

    full_acc = 100*epoch_acc/total
    full_loss = epoch_loss/total
    print('[TEST]\t Loss: {:.4f}\t Accuracy: {:.2f}%'.format(full_loss, full_acc))
    predictions = torch.cat(all_predictions)
    return predictions, full_acc, full_loss

In [None]:
if __name__=='__main__':
    evaluate(cnn_model, test_loader, criterion, use_tqdm=True) # Compute test data accuracy

Evaluating performance on the test dataset...

SOME PREDICTIONS FROM THE MODEL:


  0%|          | 0/5000 [00:00<?, ?it/s]

Input: the omega code was a model of <UNK> inconsistency . there was a bit ( but precious little ) of good acting , primarily by the two <UNK> and <UNK> , who only appeared once and had no lines . otherwise the acting was decidedly bad . the plot line was rather weak , and only partially based on already questionable biblical interpretation . certainly not one of the year's best .
Prediction: 0 	Correct Output: 0 

Input: this sure is one comedy i'm not likely to forget for a while.<br /><br <UNK> normally bother to comment on this movie : it's so minor that no one would watch it anyway , but as it happens , it's kind of popular in <UNK> sharing networks such as <UNK> , and so this <UNK> production needs to be exposed for what it is.<br /><br />so what is it then ? well , of course it's not really a comedy ; instead , it's intended as a horror flick -- " intended " very much being the key word here . the script is a totally incoherent and unbalanced mess , the special effects are only 

# Step 4: Train a Recurrent Neural Network (RNN)
We will now build a text clasification model that is based on **recurrences**.

## Define the RNN Model

First, we will define the RNN.

In [None]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout, num_classes, pad_idx):
        super(RNN, self).__init__()
        self.bidirectional=bidirectional
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Create an embedding layer (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
        #   to represent the words in your vocabulary. Make sure to use vocab_size, embed_size, and pad_idx here.
        self.embeddding= nn.Embedding(vocab_size, embed_size, pad_idx)

        # Create a recurrent network (use nn.GRU, not nn.LSTM) with batch_first = True
        # Make sure you use hidden_size, num_layers, dropout, and bidirectional here.
        self.gru = nn.GRU(embed_size, hidden_size, num_layers, dropout=dropout,batch_first=True,bidirectional=bidirectional)
        # Create a dropout layer (nn.Dropout) using dropout
        self.dropout = nn.Dropout(dropout)
        # Define a linear layer (nn.Linear) that consists of num_classes units 
        if self.bidirectional==False:
            self.fc2 = nn.Linear(hidden_size, num_classes)
        else:
            self.num_layers=self.num_layers*2
            self.fc1 = nn.Linear(hidden_size * 2, num_classes)
        #   and takes as input the output of the last timestep. In the bidirectional case, you should concatenate
        #   the output of the last timestep of the forward direction with the output of the last timestep of the backward direction).
        #self.act = nn.Sigmoid()

    def forward(self, texts):
        """
        texts: LongTensor [batch_size, MAX_LEN]
        
        Returns output: Tensor [batch_size, num_classes]
        """

        # Pass texts through your embedding layer to convert from word ids to word embeddings
        #   Resulting: shape: [batch_size, max_len, embed_size]
        embedded = self.embeddding(texts)
        # Pass the result through your recurrent network
        #   See PyTorch documentation for resulting shape for nn.GRU
    
        # Concatenate the outputs of the last timestep for each direction (see torch.cat(...))
        #   This depends on whether or not your model is bidirectional.
        #   Resulting shape: [batch_size, num_dirs*hidden_size]
        h0=torch.zeros(self.num_layers,embedded.shape[0],self.hidden_size,device=embedded.device)
        
        packed_output, hidden = self.gru(embedded,h0)
        if self.bidirectional==False:
          hidden=  hidden[-1,:,:]
          
        else:
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        # Apply dropout
        hidden = self.dropout(hidden)
        # Pass your output through the linear layer and return its output 
        #   Resulting shape: [batch_size, num_classes]
        if self.bidirectional==False:
          dense_outputs=self.fc2(hidden)  
        else:
            dense_outputs=self.fc1(hidden)
        ##### NOTE: Do not apply a sigmoid or softmax to the final output - done in training method!
        
        return dense_outputs

In [None]:
if __name__ == '__main__':
    # Test init
    inputs = [{'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}, {'vocab_size': 1000, 'embed_size': 64, 'hidden_size': 256, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}]
    expected_outputs = [44546, 44676, 27202, 27268, 82178, 82308, 39874, 39940, 1620610, 1621636, 621698, 622212, 3986050, 3987076, 1411202, 1411716, 101762, 101892, 79810, 79876, 139394, 139524, 92482, 92548, 1742338, 1743364, 706562, 707076, 4107778, 4108804, 1496066, 1496580]

    sanityCheckModel(inputs, RNN, expected_outputs, "init", None)
    print()

    # Test forward
    inputs = [{'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0, 'batch_size': 2}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 1}, {'vocab_size': 29730, 'embed_size': 16, 'hidden_size': 64, 'num_layers': 4, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0, 'batch_size': 2}]
    expected_outputs = [torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4]), torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4]), torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4]), torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4]), torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4]), torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4]), torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4]), torch.Size([1, 2]), torch.Size([2, 2]), torch.Size([1, 4]), torch.Size([2, 4])]
    sanity_dataset = TextDataset(train_data, 'train', 5, 150)
    sanity_loader = torch.utils.data.DataLoader(sanity_dataset, batch_size=50, shuffle=True, num_workers=2, drop_last=True)

    sanityCheckModel(inputs, RNN, expected_outputs, "forward", sanity_loader)

--- TEST: Number of Model Parameters (tests __init__(...)) ---
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}	Expected Num. Params: 44546	Your Num. Params: 44546
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': True, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}	Expected Num. Params: 44676	Your Num. Params: 44676
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 2, 'pad_idx': 0}	Expected Num. Params: 27202	Your Num. Params: 27202
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 2, 'bidirectional': False, 'dropout': 0, 'num_classes': 4, 'pad_idx': 0}	Expected Num. Params: 27268	Your Num. Params: 27268
	PASSED	Input: {'vocab_size': 1000, 'embed_size': 16, 'hidden_size': 32, 'num_layers': 4, '

## Train RNN Model
First, we initialize the train and test dataloaders.

In [None]:
if __name__=='__main__':
    THRESHOLD = 5 
    MAX_LEN = 200 
    BATCH_SIZE = 50 

    train_dataset = TextDataset(train_data, 'train', THRESHOLD, MAX_LEN)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, drop_last=True)

    test_dataset = TextDataset(test_data, 'test', THRESHOLD, MAX_LEN, train_dataset.idx2word, train_dataset.word2idx)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=1, drop_last=False)

In [None]:
if __name__=='__main__':
    rnn_model = RNN(vocab_size = train_dataset.vocab_size, # Don't change this
                embed_size = 128, 
                hidden_size = 128, 
                num_layers = 2,
                bidirectional = True,
                dropout = 0.5,
                num_classes = 2, # Don't change this
                pad_idx = train_dataset.word2idx[PAD]) # Don't change this

    # Put the model on device
    rnn_model = rnn_model.to(DEVICE)

    print('The model has {:,d} trainable parameters'.format(count_parameters(rnn_model)))

The model has 4,300,546 trainable parameters


In [None]:
if __name__=='__main__':    
    LEARNING_RATE = 5e-4 # Feel free to try other learning rates

    # Define your loss function
    criterion = nn.CrossEntropyLoss().to(DEVICE)

    # Define your optimizer
    optimizer = optim.Adam(rnn_model.parameters(), lr=LEARNING_RATE)

In [None]:
if __name__=='__main__':    
    N_EPOCHS = 6 # 
    
    # train model for N_EPOCHS epochs
    train_model(rnn_model, N_EPOCHS, train_loader, optimizer, criterion)

Training Model...


  0%|          | 0/6 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  1	 Loss: 0.6693	 Train Accuracy: 57.93%
[TRAIN]	 Epoch:  2	 Loss: 0.5614	 Train Accuracy: 71.08%
[TRAIN]	 Epoch:  3	 Loss: 0.4039	 Train Accuracy: 82.07%
[TRAIN]	 Epoch:  4	 Loss: 0.3160	 Train Accuracy: 86.72%
[TRAIN]	 Epoch:  5	 Loss: 0.2478	 Train Accuracy: 90.25%
[TRAIN]	 Epoch:  6	 Loss: 0.1861	 Train Accuracy: 92.69%
Model Trained!



## Evaluate RNN Model

Now we can evaluate the RNN. 

In [None]:
if __name__=='__main__':    
    evaluate(rnn_model, test_loader, criterion, use_tqdm=True) # Compute test data accuracy

Evaluating performance on the test dataset...

SOME PREDICTIONS FROM THE MODEL:


  0%|          | 0/5000 [00:00<?, ?it/s]

Input: when thinking of the revelation that the main character in " bubble " comes to at films end , i am reminded of last years " <UNK> " with christian <UNK> . the only difference between the two films is the literal physical weight of the characters.<br /><br />an understated , yet entirely realistic portrayal of small town life . the title is cause for contemplation . perhaps , we , the audience are the ones in the " bubble " as we are given no <UNK> in the films slim 90 minute running time . audience reactions were often smug and judgmental , clearly indicating how detached people can be from seeing any thread of humanity in characters so foreign to themselves . these characters are the ones people refer to as those that put george w . back in office for a second <UNK> /><br />it's <UNK> to consider how reality television has spoiled our sense of reality when watching an audience jump to their feet for the exit as soon as the credits role . this film has it's merits , and is deser