# Simple RNN Model

## Introduction
In this notebokk, we will use a recurrent neural network (RNN) to determine whether a text sequence of indefinite length contains positive or negative emotion. An RNN takes a sequence of words, X = {x_1, ..., x_T} as input. At each time step, the model generates a hidden state, h, from the current input word and the previous hidden state.

                        h_t = RNN(x_t, h_t-1)
                        
The final hidden state, h_T, will be generated from the last word in the sequence with encoded historical information from all the previous steps. Once obtaining the final state, we feed it to a dense layer (also knowns as a fully connected layer) to predict a sentiment output, Y_hat = f(h_T).

In [4]:
# Set the randomw seed for reproducibility
seed = 1234

torch.manual_seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [6]:
raw_train_data, raw_test_data = torchtext.experimental.datasets.raw.IMDB()

aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [07:19<00:00, 192kB/s] 


In [7]:
print(raw_train_data)

<torchtext.experimental.datasets.raw.text_classification.RawTextIterableDataset object at 0x7fb0a8b5fbe0>


In [13]:
raw_train_data = list(raw_train_data)
raw_test_data = list(raw_test_data)

In [14]:
print(f'Number of training examples: {len(raw_train_data):,}')
print(f'Number of testing examples: {len(raw_test_data):,}')

Number of training examples: 25,000
Number of testing examples: 25,000


In [12]:
print(raw_train_data[0])

(0, 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between

In [17]:
def get_train_valid_split(raw_train_data, split_ratio = 0.7):
    raw_train_data = list(raw_train_data)
    random.shuffle(raw_train_data)
    
    n_train_examples = int(len(raw_train_data) * split_ratio)
    
    train_data = raw_train_data[:n_train_examples]
    valid_data = raw_train_data[n_train_examples:]
    
    return train_data, valid_data 

In [18]:
raw_train_data, raw_valid_data = get_train_valid_split(raw_train_data)

In [19]:
raw_train_data = list(raw_train_data)
raw_valid_data = list(raw_valid_data)

In [20]:
print(f'Number of training examples: {len(raw_train_data):,}')
print(f'Number of validation examples: {len(raw_valid_data):,}')
print(f'Number of testing examples: {len(raw_test_data):,}')

Number of training examples: 17,500
Number of validation examples: 7,500
Number of testing examples: 25,000


In [21]:
class Tokenizer:
    def __init__(self, tokenize_fn='basic_english', lower=True, max_length=None):
        self.tokenize_func = torchtext.data.utils.get_tokenizer(tokenize_fn)
        self.lower = lower
        self.max_length = max_length
        
    def tokenize(self, s):
        tokens = self.tokenize_func(s)
        
        if self.lower:
            tokens = [token.lower() for token in tokens]
            
        if self.max_length is not None:
            tokens = tokens[:self.max_length]
            
        return tokens

In [22]:
max_length = 500

tokenizer = Tokenizer(max_length = max_length)

In [23]:
s = "hello world!"
print(tokenizer.tokenize(s))

['hello', 'world', '!']


In [24]:
print(s.split())

['hello', 'world!']


In [26]:
def build_vocab_from_data(raw_data, tokenizer, **vocab_kwargs):
    token_freq = collections.Counter()
    
    for label, text in raw_data:
        tokens = tokenizer.tokenize(text)
        token_freq.update(tokens)
    
    vocab = torchtext.vocab.Vocab(token_freq, **vocab_kwargs)
    
    return vocab

In [27]:
max_size = 25000

vocab = build_vocab_from_data(raw_train_data, tokenizer, max_size = max_size)

In [28]:
vocab.freqs.most_common(20)

[('the', 211890),
 ('.', 208427),
 (',', 173201),
 ('a', 103447),
 ('and', 103052),
 ('of', 91695),
 ('to', 84931),
 ("'", 83805),
 ('is', 67948),
 ('it', 61322),
 ('in', 58757),
 ('i', 57420),
 ('this', 49214),
 ('that', 46061),
 ('s', 38864),
 ('was', 31341),
 ('as', 29171),
 ('movie', 28776),
 ('for', 27903),
 ('with', 27680)]

In [30]:
vocab.itos[:10]

['<unk>', '<pad>', 'the', '.', ',', 'a', 'and', 'of', 'to', "'"]

In [31]:
def process_raw_data(raw_data, tokenizer, vocab):
    #raw_data = [(label, text) for (label, text) in raw_data]
    
    text_transform = sequential_transforms(tokenizer.tokenize, 
                                           vocab_func(vocab),
                                           totensor(dtype=torch.long))
    
    label_transform = sequential_transforms(totensor(dtype=torch.long))
    
    transforms = (label_transform, text_transform)
    
    dataset = TextClassificationDataset(raw_data,
                                        vocab,
                                        transforms)
    
    return dataset

In [32]:
train_data = process_raw_data(raw_train_data, tokenizer, vocab)

In [33]:
len(train_data)

17500

In [34]:
train_data

<torchtext.experimental.datasets.text_classification.TextClassificationDataset at 0x7fb0a8a854c0>

In [36]:
label, indexes = train_data[0]

print(indexes)

tensor([  610,  8612,    10,     5,    60,   221,     6,     5,    60,     0,
          172,     3,    32,    91,    19,     9,   330,     7,  1487,     9,
           17,     2,    35,    13,   414,   128,     4,     6,    11,   293,
            5,   210,     7,   116,    21,   111,   632,  1851,     7,   125,
         1427,    18,    39,    84,    90,  2552,  1114,     4,     6,    44,
          131,     2,  1851,     7,     2,   490,     3,    44,    10,     9,
         6126,     9,    72,    43,     2,   428,   496,     8,   293,     2,
          348,     7,     2,   490,     8,     2,   584,   831,   481,     3,
            2,  4673,   111,    30,   119,  1783,     4,    35,  3138,     6,
           35,  2349,    42,   854,    21,     5, 17557,  5565,  1586,     3,
            2,   560,     7,    41,   174,  3138,  2038,  1996,    42,  1431,
            8,    34,    41,  2004,  2289,     7,     2,  6375,  5693,   346,
            0,     2,  3138,  2140,    12,     2,   348,     6, 

In [38]:
print([vocab.itos[i] for i in indexes])

['david', 'mamet', 'is', 'a', 'very', 'interesting', 'and', 'a', 'very', '<unk>', 'director', '.', 'his', 'first', 'movie', "'", 'house', 'of', 'games', "'", 'was', 'the', 'one', 'i', 'liked', 'best', ',', 'and', 'it', 'set', 'a', 'series', 'of', 'films', 'with', 'characters', 'whose', 'perspective', 'of', 'life', 'changes', 'as', 'they', 'get', 'into', 'complicated', 'situations', ',', 'and', 'so', 'does', 'the', 'perspective', 'of', 'the', 'viewer', '.', 'so', 'is', "'", 'homicide', "'", 'which', 'from', 'the', 'title', 'tries', 'to', 'set', 'the', 'mind', 'of', 'the', 'viewer', 'to', 'the', 'usual', 'crime', 'drama', '.', 'the', 'principal', 'characters', 'are', 'two', 'cops', ',', 'one', 'jewish', 'and', 'one', 'irish', 'who', 'deal', 'with', 'a', 'racially', 'charged', 'area', '.', 'the', 'murder', 'of', 'an', 'old', 'jewish', 'shop', 'owner', 'who', 'proves', 'to', 'be', 'an', 'ancient', 'veteran', 'of', 'the', 'israeli', 'independence', 'war', '<unk>', 'the', 'jewish', 'identity

In [39]:
valid_data = process_raw_data(raw_valid_data, tokenizer, vocab)
test_data = process_raw_data(raw_test_data, tokenizer, vocab)

In [40]:
print(vocab['<unk'])

0


In [43]:
class Collator:
    def __init__(self, pad_idx):
        self.pad_idx = pad_idx
        
    def collate(self, batch):
        labels, text = zip(*batch)
        
        labels = torch.LongTensor(labels)
        
        text = nn.utils.rnn.pad_sequence(text, padding_value = slef.pad_idx)
        
        return labels, text

In [44]:
pad_token = '<pad>'
pad_idx = vocab[pad_token]

collator = Collator(pad_idx)

In [45]:
batch_size = 256

train_iterator = torch.utils.data.DataLoader(train_data,
                                             batch_size,
                                             shuffle = True,
                                             collate_fn = collator.collate)

valid_iterator = torch.utils.data.DataLoader(valid_data,
                                             batch_size,
                                             shuffle = False,
                                             collate_fn = collator.collate)

test_iterator = torch.utils.data.DataLoader(test_data,
                                             batch_size,
                                             shuffle = False,
                                             collate_fn = collator.collate)

In [47]:
class NBOW(nn.Module):
    def __init__(self, input_dim, emb_dim, output_dim, pad_idx):
        super(NBOW, self).__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim, padding_idx = pad_idx)
        self.fc = nn.Linear(emb_dim, output_dim)
        
    def forward(self, text):
        # text = [seq_len, batch_size]
        embedded = self.embedding(text)
        
        # embedded = [seq_len, batch_size, emb_dim]
        
        pooled = embedded.mean(0)
        
        # pooled = [batch_size, emb_dim]
        
        prediction = self.fc(pooled)
        
        # prediction = [batch_size, output_dim]
        
        return prediction

In [48]:
input_dim = len(vocab)

In [49]:
input_dim

25002

In [50]:
emb_dim = 100
output_dim = 2

model = NBOW(input_dim, emb_dim, output_dim, pad_idx)

In [51]:
model.parameters

<bound method Module.parameters of NBOW(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (fc): Linear(in_features=100, out_features=2, bias=True)
)>

In [54]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [55]:
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,500,402 trainable parameters


In [57]:
[p.numel() for p in model.parameters() if p.requires_grad]

[2500200, 200, 2]

In [65]:
glove = torchtext.experimental.vectors.GloVe(name = '6B', dim = '100')

RuntimeError: The hash of .data/glove.6B.zip does not match. Delete the file manually and retry.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext
import torchtext.experimental
import torchtext.experimental.vectors
from torchtext.experimental.datasets.raw.text_classification import RawTextIterableDataset
from torchtext.experimental.datasets.text_classification import TextClassificationDataset
from torchtext.experimental.functional import sequential_transforms, vocab_func, totensor

import collections
import random
import time