STAT 453: Deep Learning (Spring 2021)  
Instructor: Sebastian Raschka (sraschka@wisc.edu)  

Course website: http://pages.stat.wisc.edu/~sraschka/teaching/stat453-ss2021/  
GitHub repository: https://github.com/rasbt/stat453-deep-learning-ss21

---

# RNN Classifier with LSTM Trained on Own Dataset (IMDB)

Example notebook showing how to use an own CSV text dataset for training a simple RNN for sentiment classification (here: a binary classification problem with two labels, positive and negative) using LSTM (Long Short Term Memory) cells.

Since the [official migration guide](https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb) is outdated. I used a combination of sources to migrate from the old torchtext api (pre 0.9.0) to the 0.14 api:

- The official migration guide. Although it is outdated and legacy code has been removed, some parts are still useful. Primarily the method of building the vocabulary, ie using a `Counter` and iterating over the sentences. There is a new [`build_vocab_from_iterator`](https://pytorch.org/text/stable/vocab.html#build-vocab-from-iterator) method which does the same thing but it doesn't allow easy inspection of, for example, the 10 most common words

- The [torchtext documentation](https://pytorch.org/text/stable/index.html)

- The [torchdata documentation](https://pytorch.org/data/0.5/) (looking through the classes/methods in an attempt to replace to old code). Torchtext moved most of its functionality into this new package. The torchtext library is only used to build the vocabulary now

- This guide has some nice parts, however, it is missing some stuff like the splitting of the data into training and test. https://medium.com/@bitdribble/migrate-torchtext-to-the-new-0-9-0-api-1ff1472b5d71

- Torchtext Github issues such as https://github.com/pytorch/text/issues/711#issuecomment-1154000107 and https://github.com/pytorch/text/issues/1349

In [1]:
# %load_ext watermark
# %watermark -a 'Sebastian Raschka' -v -p torch,torchtext

import torch
import torch.nn.functional as F
import torchtext # Not avaialable for conda-forge on windows. Just install it using pip
import torchdata # needed as of torchtext 0.12
import time
import random
import pandas as pd
import os

torch.backends.cudnn.deterministic = True

  from .autonotebook import tqdm as notebook_tqdm


## General Settings

In [2]:
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)

VOCABULARY_SIZE = 20000
LEARNING_RATE = 0.005
BATCH_SIZE = 128
NUM_EPOCHS = 15
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

EMBEDDING_DIM = 128
HIDDEN_DIM = 256
NUM_CLASSES = 2

DATA_DIRECTORY = 'data/Raschka/Lecture15'

## Download Dataset

The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

On windows, just visit the below link, download and extract the file

In [4]:
!wget https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
!gunzip -f movie_data.csv.gz 

Check that the dataset looks okay:

In [4]:
df = pd.read_csv(os.path.join(DATA_DIRECTORY, 'movie_data.csv'))
df.tail()

Unnamed: 0,TEXT_COLUMN_NAME,LABEL_COLUMN_NAME
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


In [6]:
df.columns = ['TEXT_COLUMN_NAME', 'LABEL_COLUMN_NAME']
df.to_csv(os.path.join(DATA_DIRECTORY, 'movie_data.csv'), index=None)

df = pd.read_csv(os.path.join(DATA_DIRECTORY, 'movie_data.csv'))
df.head()

Unnamed: 0,TEXT_COLUMN_NAME,LABEL_COLUMN_NAME
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [7]:
del df

## Prepare Dataset with Torchtext

In [None]:
!pip install spacy

Download English vocabulary via:
    
- `python -m spacy download en_core_web_sm`

Load data and tokenize text

In [5]:
data = pd.read_csv('data/Raschka/Lecture15/movie_data.csv')

dp = torchdata.datapipes.map.SequenceWrapper(data.values)

In [6]:
# print first four rows of dp
list(dp)[:4]

[array(['In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and ri

In [None]:
!python -m spacy download en_core_web_sm

In [7]:
tokenizer = torchtext.data.utils.get_tokenizer('spacy', language='en_core_web_sm')

def tokenize_dp(dataset):
    text_dp, label_dp = dataset.unzip(2)

    text_dp = text_dp.map(tokenizer) # split sentences into words
    label_dp = label_dp.map(str) # convert labels to strings

    return text_dp.zip(label_dp)

tokenized_dp = tokenize_dp(dp)

In [8]:
# print first four rows of tokenized_dp
list(tokenized_dp)[:4]

[(['In',
   '1974',
   ',',
   'the',
   'teenager',
   'Martha',
   'Moxley',
   '(',
   'Maggie',
   'Grace',
   ')',
   'moves',
   'to',
   'the',
   'high',
   '-',
   'class',
   'area',
   'of',
   'Belle',
   'Haven',
   ',',
   'Greenwich',
   ',',
   'Connecticut',
   '.',
   'On',
   'the',
   'Mischief',
   'Night',
   ',',
   'eve',
   'of',
   'Halloween',
   ',',
   'she',
   'was',
   'murdered',
   'in',
   'the',
   'backyard',
   'of',
   'her',
   'house',
   'and',
   'her',
   'murder',
   'remained',
   'unsolved',
   '.',
   'Twenty',
   '-',
   'two',
   'years',
   'later',
   ',',
   'the',
   'writer',
   'Mark',
   'Fuhrman',
   '(',
   'Christopher',
   'Meloni',
   ')',
   ',',
   'who',
   'is',
   'a',
   'former',
   'LA',
   'detective',
   'that',
   'has',
   'fallen',
   'in',
   'disgrace',
   'for',
   'perjury',
   'in',
   'O.J.',
   'Simpson',
   'trial',
   'and',
   'moved',
   'to',
   'Idaho',
   ',',
   'decides',
   'to',
   'investigate

## Split Dataset into Train/Validation/Test

Split the dataset into training, validation, and test partitions.
- training: used to train the model
- validation: used to test the model during training (I dont know what the difference between this and test is).
- test: used to test the model after training

We use the `torch.utils.data.random_split` method instead of [`torchdata.datapipes.iter.RandomSplitter`](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.RandomSplitter.html#torchdata.datapipes.iter.RandomSplitter) because the RandomSplitter does not use the fact that the tokenized_dp is indexable (ie you can randomly access any of the entries). We could use it, but it causes many warnings to occur.

In [9]:
train_data_1, test_data = torch.utils.data.random_split(
    tokenized_dp,
    [0.8, 0.2],
    generator = torch.Generator().manual_seed(RANDOM_SEED)
)

print(f'Num Train: {len(train_data_1)}')
print(f'Num Test: {len(test_data)}')

Num Train: 40000
Num Test: 10000


In [10]:
train_data, valid_data = torch.utils.data.random_split(
    train_data_1,
    [0.85, 0.15],
    generator = torch.Generator().manual_seed(RANDOM_SEED)
)

print(f'Num Train: {len(train_data)}')
print(f'Num Validation: {len(valid_data)}')

Num Train: 34000
Num Validation: 6000


In [11]:
for example in train_data:
    print(example)
    break

(['The', 'GREAT', 'NEWS', 'is', 'that', 'this', 'film', 'is', 'now', 'AVAILABLE', 'on', 'DVD', 'from', 'http://treasureflix.com', 'for', 'all', 'those', 'who', 'wish', 'to', 'own', 'it', 'as', 'well', 'as', 'on', 'video', '.', 'This', 'is', 'good', 'news', 'as', 'it', 'is', 'one', 'of', 'my', 'favourite', 'films!<br', '/><br', '/>I', 'watched', 'this', 'film', 'for', 'the', 'first', 'time', 'in', 'the', '80s', 'and', 'it', 'is', 'compulsory', 'holiday', 'viewing', '.', 'Living', 'in', 'the', 'small', 'market', 'town', 'called', 'Tewkesbury', ',', 'picturesque', 'and', 'with', 'its', 'own', 'traditions', ',', 'of', 'reenactments', ',', 'and', 'traditions', 'we', 'are', 'also', 'a', 'cosy', 'tight', 'community', '.', 'We', 'are', 'now', 'also', 'faced', 'with', 'large', 'housing', 'developments', 'which', 'threaten', 'to', 'destroy', 'the', 'Community', 'and', 'you', 'can', 'see', 'why', 'I', 'love', 'this', 'film', 'First', 'of', 'all', '-', 'and', 'most', 'important', ',', 'there', 'ar

## Build Vocabulary

Build the vocabulary. We

1. Get the number of occurrences of each word/label
2. Construct a vocab object with the most common words and with special tokens `<unk>` and `<pad>`
3. Set `<unk>` to be the fallback

In [12]:
from collections import Counter, OrderedDict

text_counter = Counter()
label_counter = Counter()

for line, label in train_data:
    text_counter.update(line)
    label_counter.update([label])

In [13]:
text_vocab = torchtext.vocab.vocab(
    OrderedDict(text_counter.most_common(VOCABULARY_SIZE)), # choose most common
    specials=('<unk>', '<pad>'), # extra words added to the vocab
)
text_vocab.set_default_index(text_vocab['<unk>']) # we need to manually set that <unk> is the default

label_vocab = torchtext.vocab.vocab(OrderedDict(label_counter)) # label_vocab does not have specials

In [14]:
print(f'Vocabulary size: {len(text_vocab)}')
print(f'Number of classes: {len(label_vocab)}')

Vocabulary size: 20002
Number of classes: 2


- 20,002 not 20,000 because of the `<unk>` and `<pad>` tokens
- PyTorch RNNs can deal with arbitrary lengths due to dynamic graphs, but padding is necessary for padding sequences to the same length in a given minibatch so we can store those in an array

**Look at most common words:**

In [15]:
print(text_counter.most_common(20))

[('the', 390911), (',', 368217), ('.', 318439), ('and', 210500), ('a', 210490), ('of', 194463), ('to', 179740), ('is', 146136), ('in', 118736), ('I', 105440), ('it', 103564), ('that', 94370), ('"', 85796), ("'s", 83204), ('this', 81393), ('-', 71103), ('/><br', 68674), ('was', 67783), ('movie', 57572), ('as', 57538)]


**Tokens corresponding to the first 10 indices (0, 1, ..., 9):**

In [16]:
print(text_vocab.get_itos()[:10]) # itos = integer-to-string

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']


**Converting a string to an integer:**

In [17]:
print(text_vocab['the']) # stoi = string-to-integer

2


**Class labels:**

In [18]:
print(label_vocab.get_stoi())

{'1': 0, '0': 1}


**Class label count:**

In [19]:
label_counter

Counter({'1': 17029, '0': 16971})

## Define Data Loaders

Before, we define the dataloaders, let us get our datasets in their proper form. That is, an iterable of (list of word_index, label_index) pairs. We also, set `in_memory_cache()` to cache the results.

In [20]:
def convert_to_indexed(dataset):
    dataset = torchdata.datapipes.map.SequenceWrapper(dataset)
    text_dataset, label_dataset = dataset.unzip(2)

    text_dataset = text_dataset.map(text_vocab).map(torch.tensor)
    label_dataset = label_dataset.map(label_vocab.__getitem__).map(torch.tensor)

    dataset = text_dataset.zip(label_dataset)

    return dataset.in_memory_cache()

train_idata = convert_to_indexed(train_data)
valid_idata = convert_to_indexed(valid_data)
test_idata = convert_to_indexed(test_data)

Collate into batches. The `collate_batch function` takes a list of (list of text_index, label_index) pairs and

1. Splits it into two lists
2. Converts the label_list to a single vector
3. Converts text_list into a 2d array (matrix) with necessary padding

In [216]:
# code which attempts to produce buckets with sentences of similar length
# def sort_bucket(bucket):
#     sorted(bucket, key=lambda x: len(x[0]))
# def batch_data(dataset):
#     dataset = torchdata.datapipes.iter.BucketBatcher(
#         torchdata.datapipes.iter.IterableWrapper(dataset),
#         batch_size=BATCH_SIZE,
#         sort_key=sort_bucket
#     )
#     return dataset

# def collate_batch(batch):
#     text_list = [text for text, label in batch]
#     label_list = [label for text, label in batch]

#     padding_value = text_vocab.get_stoi()['<pad>']

#     label_list = torch.tensor(label_list)
#     text_list = torch.nn.utils.rnn.pad_sequence(text_list, padding_value=padding_value)
#     return text_list, label_list

# train_loader = batch_data(train_idata).map(collate_batch)
# valid_loader = batch_data(valid_idata).map(collate_batch)
# train_loader = batch_data(train_idata).map(collate_batch)

In [21]:

def collate_batch(batch):
    text_list = [text for text, label in batch]
    label_list = [label for text, label in batch]

    padding_value = text_vocab.get_stoi()['<pad>']

    label_list = torch.tensor(label_list)
    text_list = torch.nn.utils.rnn.pad_sequence(text_list, padding_value=padding_value)
    return text_list, label_list



# rain_iter = IMDB(split=’train’) 
train_loader = torch.utils.data.DataLoader(
                    train_idata,
                    batch_size=BATCH_SIZE,
                    collate_fn=collate_batch)

valid_loader = torch.utils.data.DataLoader(
                    valid_idata, 
                    batch_size=BATCH_SIZE,
                    collate_fn=collate_batch)

test_loader = torch.utils.data.DataLoader(
                    test_idata, 
                    batch_size=BATCH_SIZE,
                    collate_fn=collate_batch)

Testing the iterators (note that the number of rows depends on the longest document in the respective batch):

In [22]:
print('Train')
for batch in train_loader:
    print(f'Text matrix size: {batch[0].size()}')
    print(f'Target vector size: {batch[1].size()}')
    break
    
print('\nValid:')
for batch in valid_loader:
    print(f'Text matrix size: {batch[0].size()}')
    print(f'Target vector size: {batch[1].size()}')
    break
    
print('\nTest:')
for batch in test_loader:
    print(f'Text matrix size: {batch[0].size()}')
    print(f'Target vector size: {batch[1].size()}')
    break

Train
Text matrix size: torch.Size([1052, 128])
Target vector size: torch.Size([128])

Valid:
Text matrix size: torch.Size([1193, 128])
Target vector size: torch.Size([128])

Test:
Text matrix size: torch.Size([1120, 128])
Target vector size: torch.Size([128])


## Model

In [23]:
class RNN(torch.nn.Module):
    
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = torch.nn.Embedding(input_dim, embedding_dim)
        #self.rnn = torch.nn.RNN(embedding_dim,
        #                        hidden_dim,
        #                        nonlinearity='relu')
        self.rnn = torch.nn.LSTM(embedding_dim,
                                 hidden_dim)        
        
        self.fc = torch.nn.Linear(hidden_dim, output_dim)
        

    def forward(self, text):
        # text dim: [sentence length, batch size]
        
        embedded = self.embedding(text)
        # embedded dim: [sentence length, batch size, embedding dim]
        
        output, (hidden, cell) = self.rnn(embedded)
        # output dim: [sentence length, batch size, hidden dim]
        # hidden dim: [1, batch size, hidden dim]

        hidden.squeeze_(0)
        # hidden dim: [batch size, hidden dim]
        
        output = self.fc(hidden)
        return output

In [24]:
torch.manual_seed(RANDOM_SEED)
model = RNN(input_dim=len(text_vocab),
            embedding_dim=EMBEDDING_DIM,
            hidden_dim=HIDDEN_DIM,
            output_dim=NUM_CLASSES # could use 1 for binary classification
)

model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

## Training

In [25]:
def compute_accuracy(model, data_loader, device):

    with torch.no_grad():

        correct_pred, num_examples = 0, 0

        for i, (features, targets) in enumerate(data_loader):

            features = features.to(device)
            targets = targets.float().to(device)

            logits = model(features)
            _, predicted_labels = torch.max(logits, 1)

            num_examples += targets.size(0)
            correct_pred += (predicted_labels == targets).sum()
    return correct_pred.float()/num_examples * 100

In [26]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
# for epoch in range(2):
    model.train()
    for batch_idx, batch_data in enumerate(train_loader):
        
        text = batch_data[0].to(DEVICE)
        labels = batch_data[1].to(DEVICE)

        ### FORWARD AND BACK PROP
        logits = model(text)
        loss = F.cross_entropy(logits, labels)
        optimizer.zero_grad()
        
        loss.backward()
        
        ### UPDATE MODEL PARAMETERS
        optimizer.step()
        
        ### LOGGING
        if not batch_idx % 50:
            print (f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                   f'Batch {batch_idx:03d}/{len(train_loader):03d} | '
                   f'Loss: {loss:.4f}')

    with torch.set_grad_enabled(False):
        print(f'training accuracy: '
              f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\nvalid accuracy: '
              f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%'
            )
        
    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')
    
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch: 001/015 | Batch 000/266 | Loss: 0.6902
Epoch: 001/015 | Batch 050/266 | Loss: 0.6930
Epoch: 001/015 | Batch 100/266 | Loss: 0.6940
Epoch: 001/015 | Batch 150/266 | Loss: 0.6911
Epoch: 001/015 | Batch 200/266 | Loss: 0.7045
Epoch: 001/015 | Batch 250/266 | Loss: 0.6944
training accuracy: 50.09%
valid accuracy: 50.18%
Time elapsed: 0.94 min
Epoch: 002/015 | Batch 000/266 | Loss: 0.6929
Epoch: 002/015 | Batch 050/266 | Loss: 0.6904
Epoch: 002/015 | Batch 100/266 | Loss: 0.6928
Epoch: 002/015 | Batch 150/266 | Loss: 0.6883
Epoch: 002/015 | Batch 200/266 | Loss: 0.7093
Epoch: 002/015 | Batch 250/266 | Loss: 0.6935
training accuracy: 50.23%
valid accuracy: 50.15%
Time elapsed: 1.43 min
Epoch: 003/015 | Batch 000/266 | Loss: 0.6928
Epoch: 003/015 | Batch 050/266 | Loss: 0.6867
Epoch: 003/015 | Batch 100/266 | Loss: 0.6948
Epoch: 003/015 | Batch 150/266 | Loss: 0.6881
Epoch: 003/015 | Batch 200/266 | Loss: 0.6920
Epoch: 003/015 | Batch 250/266 | Loss: 0.6898
training accuracy: 50.28%
va

Save our vocab and model

In [27]:
torch.save(text_vocab, os.path.join(DATA_DIRECTORY, 'text_vocab.pkl'))
torch.save(label_vocab, os.path.join(DATA_DIRECTORY, 'label_vocab.pkl'))

In [28]:
torch.save(model, os.path.join(DATA_DIRECTORY, 'my_model.pkl'))

# Test out model

Load from file

In [29]:
loaded_text_vocab = torch.load(os.path.join(DATA_DIRECTORY, 'text_vocab.pkl'))
loaded_label_vocab = torch.load(os.path.join(DATA_DIRECTORY, 'label_vocab.pkl'))

loaded_model = torch.load(os.path.join(DATA_DIRECTORY, 'my_model.pkl'), map_location=DEVICE)

Function to produce model predictions from a given sentence

In [30]:
import spacy

nlp = spacy.blank("en")

def predict_sentiment(model, sentence):
    model.eval() # switch model to eval mode

    tokenized = [tok.text for tok in nlp.tokenizer(sentence)] # tokenize text
    indexed = loaded_text_vocab(tokenized) # get labels

    tensor = torch.LongTensor(indexed).to(DEVICE)
    tensor = tensor.unsqueeze(1) # add an extra dimension to act as if it is multiple sentences
    
    prediction = torch.nn.functional.softmax(model(tensor), dim=1)
    prediction = prediction[0] # get rid of unnessary dimension

    # get dictionary from labels to probabilities
    prediction_dict = {}
    for i, prob in enumerate(prediction):
        label = loaded_label_vocab.get_itos()[i]
        prediction_dict[label] = prob.item()

    # convert labels to something readable
    label_to_readable = {
        '0': 'negative',
        '1': 'positive',
    }
    prediction_dict = {label_to_readable[k]: v for k,v in prediction_dict.items()}

    return prediction_dict

In [31]:
predict_sentiment(loaded_model, "This is such an awesome movie, I really love it!")

{'positive': 0.9998838901519775, 'negative': 0.000116095005068928}

In [32]:
predict_sentiment(loaded_model, "I really hate this movie. It is really bad and sucks!")

{'positive': 0.08141873776912689, 'negative': 0.9185812473297119}