# RNN and Natural Language Processing

# RNN and Natural Language Processing

In this exercise we will try to classify IMDB dataset: Given the text of a review, can you predict if the review was positive or negative?

Before doing this exercise, you might want to become more familier with LSTMs by considering the example FlightPassengerPredictions.

The data for this exercise can be found here:
https://sid.erda.dk/share_redirect/encok5nw3y

***

Author: Julius Kirkegaard and Troels C. Petersen<br>
Date: 14th of May 2023

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import pickle
from collections import defaultdict
from torch import nn
import json
from torch.utils.data import DataLoader
from tqdm import tqdm
from itertools import chain
from collections import Counter

device = "cuda" if torch.cuda.is_available() else "cpu"

## Data

In [2]:
limit_data = 10000  # limit the amount of data for speed, change as you please

def remove_special_symbols(string):
    return ''.join(s for s in string if ord(s)>96 and ord(s)<123 or s == ' ')

with open('train.json') as f:
    train_text, train_labels = json.load(f)
    idxs = np.random.permutation(len(train_text))
    train_text = [remove_special_symbols(train_text[i]) for i in idxs[:limit_data]]
    train_labels = [train_labels[i] for i in idxs[:limit_data]]
    
with open('test.json') as f:
    test_text, test_labels = json.load(f)
    idxs = np.random.permutation(len(test_text))   
    test_text = [remove_special_symbols(test_text[i]) for i in idxs[:limit_data]]
    test_labels = [test_labels[i] for i in idxs[:limit_data]]

Let's have a look at the data... Here is a negative review (label = 0)

In [3]:
print(train_text[0])
print(train_labels[0])

es as the other reviewers have already stated this may not be vintage  but its far from being their worst work as at th entury tupid mean ox his film certainly has all of the basic ingredients for things to go wrong for the boys ut its their serious approach and determination that makes them funny hey dont play it for laughs as other comedians might but they take their work and situation quite seriously and that is the essence of their eternal humor n this film they are faced with some basic issues that really might be encountered by any one of us today namely job related stress irst we would get checked out by a doctor and he would prescribe some much needed rest and perhaps staying by the sea hats where the surrealness comes in to all of this  always take a most plausible set of circumstances and exaggerate it but never to the point of being incredible except maybe once in awhile his makes us laugh because we can relate to their self caused predicaments and attempts at extrication ha

And a positive one:

In [4]:
print(train_text[3])
print(train_labels[3])

aniel ay ewis is one of the best actors of our time and one of my favorites t is amazing how much he throws himself in each of the characters he plays making them realbr br  remember many years ago we had a party in our house  the friends came over we were sitting around the table eating drinking the wine talking laughing  having a good time he  was on  there was a movie which we did not pay much attention to hen suddenly all of us stopped talking and laughing he glasses did not clink the forks did not move the food was getting cold on the plates e could not take our eyes off the screen where the young crippled man whose entire body was against him and who only had a control over his left foot picked up a piece of chalk with his foot and for what seemed the eternity tried to write just one word on the floor hen he finished writing that one word we all knew that we had witnessed not one but three triumphs  the triumph of a human will and spirit the triumph of the cinema which was able t

## Embedding

Let's have a look at all the words we have:

In [5]:
all_words = list(chain(*[x.lower().split() for x in train_text]))
print('total number of unique words =', len(set(all_words)))

total number of unique words = 75375


That's a lot of words... of course, we could clean the data even more if we wanted to. But we won't...
(for instance, there are probably many misspelled words)

In [6]:
Counter(all_words)

Counter({'es': 684,
         'as': 17734,
         'the': 117119,
         'other': 3470,
         'reviewers': 104,
         'have': 11020,
         'already': 523,
         'stated': 52,
         'this': 24254,
         'may': 1221,
         'not': 11062,
         'be': 10681,
         'vintage': 19,
         'but': 13809,
         'its': 7370,
         'far': 1106,
         'from': 8025,
         'being': 2608,
         'their': 4451,
         'worst': 969,
         'work': 1639,
         'at': 8964,
         'th': 318,
         'entury': 52,
         'tupid': 20,
         'mean': 663,
         'ox': 158,
         'his': 17181,
         'film': 15150,
         'certainly': 538,
         'has': 6620,
         'all': 8564,
         'of': 58258,
         'basic': 184,
         'ingredients': 41,
         'for': 16827,
         'things': 1444,
         'to': 54078,
         'go': 1807,
         'wrong': 660,
         'boys': 228,
         'ut': 2983,
         'serious': 394,
         'a

In [7]:
n_words = 25000   # let's make a model that only understand 25000 words

In [8]:
words, count = np.unique(all_words, return_counts=True)
idxs = np.argsort(count)[-n_words:]
vocab = ['<UNK>'] + list(words[idxs][::-1])
print(vocab[:5], '...', vocab[-5:])

['<UNK>', 'the', 'a', 'and', 'of'] ... ['onchata', 'comparative', 'onceiving', 'onceinalifetime', 'omo']


Not very surprisingly, the most commonly used word is _the_. The 25000th most used word is _chimneys_. We have added a special word `<UNK>` which we will used to mark words outside our vocabulary.

We can now turn a sentence into a sequence of integers that correspond to the position in the vocab.

In [9]:
vocab_d = {vocab[i]: i for i in range(len(vocab))}  # for quick look-up
def sentence_to_integer_sequence(s):
    return torch.tensor([vocab_d[x] if x in vocab_d else 0 for x in s.split()], dtype=torch.long)

In [10]:
sentence_to_integer_sequence("i really liked the movie xenopus51")

tensor([147,  62, 443,   1,  18,   0])

We are now representing words in a "25000"-dimensional space: we have a unique integer for each word we can represent. To reduce this complexity, we instead intend to represent each word as a 50-dimensional real vector. Pytorch to the rescue:

In [11]:
embedding = nn.Embedding(len(vocab), 50)

`nn.Embedding` assigns are random, trainable vector to each word. For instance:

In [12]:
print(embedding(sentence_to_integer_sequence("movie")))

tensor([[-1.2968,  1.0202,  0.0757, -0.3996, -0.6972, -0.1607,  1.3119, -1.8031,
          1.6506, -2.2972,  0.5593, -1.1413, -0.9477,  0.4262,  1.4848,  1.5991,
         -0.2156, -0.4719,  1.0799,  2.1833, -0.3499, -0.4861, -0.4047, -2.8566,
          1.3584,  0.7612,  0.5755, -0.6354,  1.2501, -0.0080,  0.4063,  1.3415,
          0.9766,  0.4744,  2.0880, -0.5770, -0.0662,  0.2420, -0.3170,  0.5590,
          0.2885,  0.1820, -0.1330, -0.0611,  1.7603,  1.9476,  0.8143,  0.4350,
          0.3666,  1.4955]], grad_fn=<EmbeddingBackward>)


The special `<UNK>` word, which signifies unknown we can choose to zero out:

In [13]:
embedding.weight.data[0, :] = 0

Here is an example of how we represent a sentence then:

In [14]:
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")).shape)
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")))

torch.Size([6, 50])
tensor([[ 5.1276e-02,  6.7472e-01,  2.2765e-01, -5.6824e-01, -2.7173e+00,
          1.9144e+00,  1.6569e-01,  1.1173e+00, -4.9625e-01,  1.0714e+00,
          4.4696e-01, -5.3708e-01, -1.2415e+00, -5.1944e-02, -8.7727e-01,
          3.1294e-01,  1.5166e+00, -7.9451e-01, -7.4870e-01, -3.9630e-01,
          5.0324e-01, -4.5160e-01, -6.0607e-01, -7.7758e-01, -1.1998e+00,
          9.9511e-01, -1.5146e+00, -3.0451e-01,  2.7203e+00, -2.2697e+00,
         -6.1929e-02, -2.1723e-01,  1.4525e+00,  8.1341e-01,  4.1871e-02,
          6.2729e-01, -8.6162e-01, -7.1337e-02,  7.6250e-01, -2.3121e+00,
          2.0560e+00, -8.0771e-01,  6.5941e-01, -3.5106e-01, -7.1696e-02,
         -1.8593e+00,  1.4441e+00, -2.5056e+00, -9.8652e-01,  5.7731e-01],
        [ 2.7958e-01, -8.8597e-02,  1.4613e+00,  1.8589e-01, -1.7020e-01,
         -2.8023e-01,  2.9275e-01,  4.4134e-01, -5.3426e-02, -2.1714e+00,
          1.7586e+00,  1.7863e+00, -8.9244e-01, -4.0376e-01, -1.0448e+00,
         -2.7645e

In this way, the sentence bascially becomes a 6x50 pixel image.

You can now use `nn.LSTM` after an embedding to define a neural network for sentenes.

This network will have a _lot_ of parameters. For each word, 50 parameters needs to be trained, and then comes the LSTM on top of that.
This is the reason that _transfer learning_ is so important in natural language processing (NLP).

Perhaps the simplest form of transfer learning is to use a pretrained embedding layer.

In [15]:
with open('glove.6B.50d.pkl', 'rb') as f:
    glove = pickle.load(f)

In [16]:
glove['movie']

array([ 0.30824 ,  0.17223 , -0.23339 ,  0.023105,  0.28522 ,  0.23076 ,
       -0.41048 , -1.0035  , -0.2072  ,  1.4327  , -0.80684 ,  0.68954 ,
       -0.43648 ,  1.1069  ,  1.6107  , -0.31966 ,  0.47744 ,  0.79395 ,
       -0.84374 ,  0.064509,  0.90251 ,  0.78609 ,  0.29699 ,  0.76057 ,
        0.433   , -1.5032  , -1.6423  ,  0.30256 ,  0.30771 , -0.87057 ,
        2.4782  , -0.025852,  0.5013  , -0.38593 , -0.15633 ,  0.45522 ,
        0.04901 , -0.42599 , -0.86402 , -1.3076  , -0.29576 ,  1.209   ,
       -0.3127  , -0.72462 , -0.80801 ,  0.082667,  0.26738 , -0.98177 ,
       -0.32147 ,  0.99823 ])

These pretrained word embeddings will have a good structure to them. For instance:

In [17]:
print('Distance b/w queen and prince =', np.linalg.norm(glove['queen'] - glove['prince']))
print('Distance b/w movie and prince =', np.linalg.norm(glove['movie'] - glove['prince']))

Distance b/w queen and prince = 3.3926491063677284
Distance b/w movie and prince = 6.450784821820618


You can even sometimes get away with doing algebra with these vectors:

In [18]:
queenlike = glove['king'] - glove['man'] + glove['woman']

In [19]:
print('Distance b/w queen and algebraic queen =', np.linalg.norm(glove['queen'] - queenlike))
print('Distance b/w queen and king =', np.linalg.norm(glove['queen'] - glove['king']))

Distance b/w queen and algebraic queen = 2.8391206432941996
Distance b/w queen and king = 3.4777562289742345


We can fill out our embedding layer using these pretrained vectors:

In [20]:
filled = 0
not_found = []
for i, w in enumerate(vocab):
    if w in glove:
        embedding.weight.data[i, :] = torch.tensor(glove[w], dtype=torch.float)
        filled += 1
    else:
        not_found.append(w)
print(f'Of the words in the vocab, {100 * filled / (len(vocab) - 1)} % were updated using Glove vectors')
print('Examples of words not found :', not_found[1:7])

Of the words in the vocab, 73.376 % were updated using Glove vectors
Examples of words not found : ['owever', 'ichael', 'itbr', 'nfortunately', 'nglish', 'ritish']


Clearly the words we do not find are due to misspellings. You now have three choices before you continue with the exercise:

(1) Use _hunspell_ or similar to fix misspelled words

(2) Change the vocabulary (`vocab`) to be based on words that are both frequent in the text and have glove vectors

(3) Ignore the issue.

Finally, with Glove vectors we do not need to train the embedding. We can consider it fixed:

In [21]:
embedding.requires_grad_(False)

Embedding(25001, 50)

### Exercise 1

Train a neural network using `nn.LSTM` layers to classify the IMDB reviews.

Ideas:

 - The sentences can be quite long, so you might want to limit them to, say, maximum 100 words
 - `nn.LSTM` can take batched input, but be careful with `batch_first=True/False`.
 - If batched input are used, they normally have to be equal in size. You can put sentences to always be 100 words long, by adding `<UNK>` words to short sentences.
 - Alternatively, `nn.LSTM` can also accept batched, variable-length input using `torch.nn.utils.rnn.pack_sequence`.
 - (a finaly, albeit slow, alternative is to simple run the LSTM on un-batched input)


In [22]:
words, count = np.unique(all_words, return_counts=True)
words = words[np.argsort(count)[::-1]]
vocab = ['<UNK>']
for w in words:
    if w in glove:
        vocab.append(w)    
    if len(vocab) == n_words:
        break

print(vocab[:10])
print(vocab[-10:])

['<UNK>', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'he', 'that']
['chessboard', 'chested', 'overstates', 'migraine', 'chow', 'minuets', 'righthand', 'migraines', 'charted', 'mechanisms']


In [23]:
embedding = nn.Embedding(len(vocab), 50)
embedding.requires_grad_(False)
for i, w in enumerate(vocab):
    if w in glove:
        embedding.weight[i, :] = torch.tensor(glove[w], dtype=embedding.weight.dtype)
embedding.weight[0, :] = 0

In [24]:
vocab_d = {vocab[i]: i for i in range(len(vocab))}  # for quick look-up
max_seq_len = 100

def sentence_to_integer_sequence(s):
    return torch.tensor([vocab_d[x] if x in vocab_d else 0 for x in s.split()][:max_seq_len], dtype=torch.long)


train_seq = [sentence_to_integer_sequence(x) for x in train_text]
test_seq = [sentence_to_integer_sequence(x) for x in test_text]

In [25]:
class IMDBNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(50, 30, num_layers=2, batch_first=True, bidirectional=True)
        self.linear = nn.Linear(2 * 30, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        x = embedding(x)
        output, _ = self.lstm(x)
        final_hidden = output[:, -1, :]
        return self.sigmoid(self.linear(final_hidden)[:, 0])
    
net = IMDBNet()
embedding.to(device)
net.to(device)

IMDBNet(
  (lstm): LSTM(50, 30, num_layers=2, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=60, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [26]:
from torch.nn.utils.rnn import pad_sequence
train = pad_sequence(train_seq, batch_first=True)
test = pad_sequence(test_seq, batch_first=True)

In [27]:
from torch.utils.data import DataLoader
train_loader = DataLoader(list(zip(train, train_labels)), batch_size=128)
test_loader = DataLoader(list(zip(test, test_labels)), batch_size=128)

In [28]:
opt = torch.optim.Adam(net.parameters(), lr=1e-3)

loss_criterion = nn.BCELoss()
net.to(device)
for epoch in range(100):
    acc_loss = 0.0
    for seq, label in tqdm(train_loader):
        seq = seq.to(device)
        label = label.to(device).float()
        
        output = net(seq)
        loss = loss_criterion(output, label)
        loss.backward()  # calculate gradients: d Loss / d Paramters
        opt.step()  # take a step down-hill
        opt.zero_grad()  # zero the gradient calculations for next iteration
        acc_loss += float(loss)
    acc_loss /= len(train)
    print(f'Epoch {epoch + 1} loss = {acc_loss}')

100%|██████████| 79/79 [00:14<00:00,  5.28it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.05it/s]

Epoch 1 loss = 0.005389249348640442


100%|██████████| 79/79 [00:14<00:00,  5.35it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.38it/s]

Epoch 2 loss = 0.00536091159582138


100%|██████████| 79/79 [00:15<00:00,  5.23it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.28it/s]

Epoch 3 loss = 0.005470919674634933


100%|██████████| 79/79 [00:14<00:00,  5.31it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.34it/s]

Epoch 4 loss = 0.005358284306526184


100%|██████████| 79/79 [00:14<00:00,  5.32it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.41it/s]

Epoch 5 loss = 0.005359417927265167


100%|██████████| 79/79 [00:15<00:00,  5.24it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.05it/s]

Epoch 6 loss = 0.005450261968374252


100%|██████████| 79/79 [00:15<00:00,  5.26it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.14it/s]

Epoch 7 loss = 0.005420373630523682


100%|██████████| 79/79 [00:15<00:00,  5.26it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.35it/s]

Epoch 8 loss = 0.0053317811846733095


100%|██████████| 79/79 [00:14<00:00,  5.27it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.25it/s]

Epoch 9 loss = 0.005136041814088822


100%|██████████| 79/79 [00:15<00:00,  5.10it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.20it/s]

Epoch 10 loss = 0.005524197727441788


100%|██████████| 79/79 [00:15<00:00,  5.20it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.23it/s]

Epoch 11 loss = 0.005439566004276275


100%|██████████| 79/79 [00:15<00:00,  5.07it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 12 loss = 0.005424370574951172


100%|██████████| 79/79 [00:15<00:00,  5.06it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 13 loss = 0.005392865210771561


100%|██████████| 79/79 [00:15<00:00,  5.07it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.01it/s]

Epoch 14 loss = 0.005347777825593948


100%|██████████| 79/79 [00:15<00:00,  5.04it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.15it/s]

Epoch 15 loss = 0.005252210301160813


100%|██████████| 79/79 [00:15<00:00,  5.03it/s]
  1%|▏         | 1/79 [00:00<00:15,  4.98it/s]

Epoch 16 loss = 0.005129795449972153


100%|██████████| 79/79 [00:15<00:00,  5.02it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.17it/s]

Epoch 17 loss = 0.004891270017623901


100%|██████████| 79/79 [00:15<00:00,  5.12it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.19it/s]

Epoch 18 loss = 0.0045809281706809995


100%|██████████| 79/79 [00:15<00:00,  5.04it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 19 loss = 0.004361371365189552


100%|██████████| 79/79 [00:15<00:00,  5.03it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 20 loss = 0.0042273925215005875


100%|██████████| 79/79 [00:15<00:00,  5.16it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.20it/s]

Epoch 21 loss = 0.004133732342720031


100%|██████████| 79/79 [00:15<00:00,  5.03it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 22 loss = 0.0039800951093435285


100%|██████████| 79/79 [00:16<00:00,  4.89it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 23 loss = 0.0038960348933935163


100%|██████████| 79/79 [00:16<00:00,  4.79it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.23it/s]

Epoch 24 loss = 0.0038344419807195664


100%|██████████| 79/79 [00:15<00:00,  5.08it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.34it/s]

Epoch 25 loss = 0.003780973729491234


100%|██████████| 79/79 [00:15<00:00,  4.96it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.01it/s]

Epoch 26 loss = 0.003724361830949783


100%|██████████| 79/79 [00:15<00:00,  5.06it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 27 loss = 0.0036750845313072203


100%|██████████| 79/79 [00:15<00:00,  5.17it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.33it/s]

Epoch 28 loss = 0.00363691782951355


100%|██████████| 79/79 [00:15<00:00,  5.23it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.24it/s]

Epoch 29 loss = 0.0035966800928115845


100%|██████████| 79/79 [00:15<00:00,  5.22it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 30 loss = 0.00357138038277626


100%|██████████| 79/79 [00:15<00:00,  4.98it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.13it/s]

Epoch 31 loss = 0.003564124596118927


100%|██████████| 79/79 [00:17<00:00,  4.61it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 32 loss = 0.0035082209646701812


100%|██████████| 79/79 [00:15<00:00,  4.99it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 33 loss = 0.0034760828703641893


100%|██████████| 79/79 [00:15<00:00,  5.04it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.02it/s]

Epoch 34 loss = 0.0034585849821567537


100%|██████████| 79/79 [00:15<00:00,  5.17it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.42it/s]

Epoch 35 loss = 0.0034003431648015974


100%|██████████| 79/79 [00:15<00:00,  5.09it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 36 loss = 0.00334203542470932


100%|██████████| 79/79 [00:15<00:00,  5.09it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.16it/s]

Epoch 37 loss = 0.0033410898715257645


100%|██████████| 79/79 [00:15<00:00,  5.20it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.21it/s]

Epoch 38 loss = 0.0033122685760259628


100%|██████████| 79/79 [00:15<00:00,  5.16it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 39 loss = 0.003295270308852196


100%|██████████| 79/79 [00:15<00:00,  4.98it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.18it/s]

Epoch 40 loss = 0.003252117520570755


100%|██████████| 79/79 [00:15<00:00,  5.23it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.20it/s]

Epoch 41 loss = 0.0031981156349182127


100%|██████████| 79/79 [00:15<00:00,  5.15it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.36it/s]

Epoch 42 loss = 0.003125904953479767


100%|██████████| 79/79 [00:14<00:00,  5.39it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.38it/s]

Epoch 43 loss = 0.003089902475476265


100%|██████████| 79/79 [00:15<00:00,  5.18it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 44 loss = 0.003055961447954178


100%|██████████| 79/79 [00:16<00:00,  4.66it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 45 loss = 0.0029714376404881477


100%|██████████| 79/79 [00:15<00:00,  5.16it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.39it/s]

Epoch 46 loss = 0.002908417809009552


100%|██████████| 79/79 [00:15<00:00,  5.20it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.39it/s]

Epoch 47 loss = 0.0028618209347128867


100%|██████████| 79/79 [00:15<00:00,  5.17it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.54it/s]

Epoch 48 loss = 0.002860668797045946


100%|██████████| 79/79 [00:15<00:00,  5.25it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.54it/s]

Epoch 49 loss = 0.0027574351474642755


100%|██████████| 79/79 [00:15<00:00,  5.19it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.36it/s]

Epoch 50 loss = 0.0027101651564240456


100%|██████████| 79/79 [00:14<00:00,  5.30it/s]
  1%|▏         | 1/79 [00:00<00:13,  5.66it/s]

Epoch 51 loss = 0.0027009857654571533


100%|██████████| 79/79 [00:17<00:00,  4.56it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 52 loss = 0.0026113946571946142


100%|██████████| 79/79 [00:15<00:00,  5.02it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.40it/s]

Epoch 53 loss = 0.0025645316798239945


100%|██████████| 79/79 [00:15<00:00,  5.21it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.32it/s]

Epoch 54 loss = 0.002578785737603903


100%|██████████| 79/79 [00:15<00:00,  5.22it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.51it/s]

Epoch 55 loss = 0.0024693561844527723


100%|██████████| 79/79 [00:15<00:00,  5.12it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 56 loss = 0.002431293272972107


100%|██████████| 79/79 [00:14<00:00,  5.29it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 57 loss = 0.002613813494890928


100%|██████████| 79/79 [00:15<00:00,  5.10it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.09it/s]

Epoch 58 loss = 0.002466647853702307


100%|██████████| 79/79 [00:15<00:00,  5.22it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.15it/s]

Epoch 59 loss = 0.0024741932429373265


100%|██████████| 79/79 [00:14<00:00,  5.29it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.43it/s]

Epoch 60 loss = 0.002401602405309677


100%|██████████| 79/79 [00:15<00:00,  5.03it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.35it/s]

Epoch 61 loss = 0.002351766015961766


100%|██████████| 79/79 [00:14<00:00,  5.29it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.12it/s]

Epoch 62 loss = 0.0022194566927850245


100%|██████████| 79/79 [00:15<00:00,  5.23it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.46it/s]

Epoch 63 loss = 0.0021562471702694895


100%|██████████| 79/79 [00:14<00:00,  5.27it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.50it/s]

Epoch 64 loss = 0.0020958400301635265


100%|██████████| 79/79 [00:15<00:00,  5.25it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.52it/s]

Epoch 65 loss = 0.002021418996155262


100%|██████████| 79/79 [00:15<00:00,  5.23it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.15it/s]

Epoch 66 loss = 0.0020134456373751162


100%|██████████| 79/79 [00:15<00:00,  5.20it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.14it/s]

Epoch 67 loss = 0.0018933440232649446


100%|██████████| 79/79 [00:14<00:00,  5.42it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.57it/s]

Epoch 68 loss = 0.00184073895085603


100%|██████████| 79/79 [00:14<00:00,  5.35it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.05it/s]

Epoch 69 loss = 0.0017879243981093168


100%|██████████| 79/79 [00:15<00:00,  5.07it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.18it/s]

Epoch 70 loss = 0.0018194562505930662


100%|██████████| 79/79 [00:15<00:00,  5.17it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 71 loss = 0.001822534103691578


100%|██████████| 79/79 [00:15<00:00,  5.26it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.28it/s]

Epoch 72 loss = 0.0017764240853488445


100%|██████████| 79/79 [00:15<00:00,  5.21it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.14it/s]

Epoch 73 loss = 0.0018119452185928822


100%|██████████| 79/79 [00:15<00:00,  5.16it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.28it/s]

Epoch 74 loss = 0.001985166067443788


100%|██████████| 79/79 [00:15<00:00,  5.17it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.03it/s]

Epoch 75 loss = 0.001808354996331036


100%|██████████| 79/79 [00:15<00:00,  5.10it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.14it/s]

Epoch 76 loss = 0.0017587558280676603


100%|██████████| 79/79 [00:15<00:00,  5.11it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.07it/s]

Epoch 77 loss = 0.0015813358705490828


100%|██████████| 79/79 [00:15<00:00,  5.20it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.48it/s]

Epoch 78 loss = 0.0015697043407708407


100%|██████████| 79/79 [00:15<00:00,  5.22it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.42it/s]

Epoch 79 loss = 0.0016087977522984147


100%|██████████| 79/79 [00:15<00:00,  5.19it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.34it/s]

Epoch 80 loss = 0.0015434708714485168


100%|██████████| 79/79 [00:15<00:00,  5.15it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.40it/s]

Epoch 81 loss = 0.0014455534882843494


100%|██████████| 79/79 [00:14<00:00,  5.41it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.50it/s]

Epoch 82 loss = 0.0013976826804690064


100%|██████████| 79/79 [00:14<00:00,  5.41it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.43it/s]

Epoch 83 loss = 0.0014329652297310532


100%|██████████| 79/79 [00:14<00:00,  5.42it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.42it/s]

Epoch 84 loss = 0.0016088279194198549


100%|██████████| 79/79 [00:14<00:00,  5.42it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.49it/s]

Epoch 85 loss = 0.001416977541334927


100%|██████████| 79/79 [00:14<00:00,  5.41it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.41it/s]

Epoch 86 loss = 0.0016288387274369597


100%|██████████| 79/79 [00:15<00:00,  5.03it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.30it/s]

Epoch 87 loss = 0.001679326254595071


100%|██████████| 79/79 [00:15<00:00,  5.19it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.26it/s]

Epoch 88 loss = 0.0012500101488083601


100%|██████████| 79/79 [00:15<00:00,  5.20it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.41it/s]

Epoch 89 loss = 0.0011477841725572944


100%|██████████| 79/79 [00:14<00:00,  5.39it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.39it/s]

Epoch 90 loss = 0.0011430880770552903


100%|██████████| 79/79 [00:14<00:00,  5.36it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.20it/s]

Epoch 91 loss = 0.0011508140650577843


100%|██████████| 79/79 [00:14<00:00,  5.36it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.46it/s]

Epoch 92 loss = 0.0011033616974018515


100%|██████████| 79/79 [00:14<00:00,  5.36it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.38it/s]

Epoch 93 loss = 0.0010050127591006458


100%|██████████| 79/79 [00:15<00:00,  5.19it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 94 loss = 0.001077073425007984


100%|██████████| 79/79 [00:15<00:00,  5.13it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 95 loss = 0.001139446108788252


100%|██████████| 79/79 [00:15<00:00,  5.19it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.22it/s]

Epoch 96 loss = 0.0011172955614514649


100%|██████████| 79/79 [00:15<00:00,  5.18it/s]
  1%|▏         | 1/79 [00:00<00:14,  5.46it/s]

Epoch 97 loss = 0.0010727595877833664


100%|██████████| 79/79 [00:15<00:00,  5.11it/s]
  0%|          | 0/79 [00:00<?, ?it/s]

Epoch 98 loss = 0.0010368263479787856


100%|██████████| 79/79 [00:15<00:00,  5.14it/s]
  1%|▏         | 1/79 [00:00<00:15,  5.11it/s]

Epoch 99 loss = 0.001125620032986626


100%|██████████| 79/79 [00:15<00:00,  4.94it/s]

Epoch 100 loss = 0.001214676957949996





In [29]:
acc = 0
count = 0
for seq, label in tqdm(train_loader):
    seq = seq.to(device)
    label = label.to(device)
    output = net(seq)
    acc += torch.sum(torch.eq(torch.round(output), label))
    count += len(label)
acc = acc / count
print('Train accuracy =', acc)

acc = 0
count = 0
for seq, label in tqdm(test_loader):
    seq = seq.to(device)
    label = label.to(device)
    output = net(seq)
    acc += torch.sum(torch.eq(torch.round(output), label))
    count += len(label)
acc = acc / count
print('Test accuracy =', acc)

100%|██████████| 79/79 [00:08<00:00,  9.14it/s]
  1%|▏         | 1/79 [00:00<00:08,  9.62it/s]

Train accuracy = tensor(0.9623)


100%|██████████| 79/79 [00:08<00:00,  9.64it/s]

Test accuracy = tensor(0.7646)





In [30]:
embedding.cpu()
net.cpu()

IMDBNet(
  (lstm): LSTM(50, 30, num_layers=2, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=60, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [31]:
float(net(sentence_to_integer_sequence("fargo is a great comedy")[None, :]))

0.9819666147232056

In [32]:
float(net(sentence_to_integer_sequence("worst movie i have ever seen")[None, :]))

0.22874943912029266

In [33]:
float(net(sentence_to_integer_sequence("best movie i have ever seen")[None, :]))

0.8959909081459045

In [34]:
float(net(sentence_to_integer_sequence("i don't really know how i feel about this movie")[None, :]))

0.540596604347229

In [35]:
float(net(sentence_to_integer_sequence("at first i thought this would be terrible, but then it turned out to be really nice")[None, :]))

0.82257479429245

In [36]:
float(net(sentence_to_integer_sequence("at first i thought this would be good, but then it turned out to be really terrible")[None, :]))

0.20609375834465027

## Language model

Often in NLP, a lot of unlabelled text is available. We can use this to pretrain the model, before fine-tuning to the task at hand.

In [37]:
with open('unlabelled.json') as f:
    text = json.load(f)
print(len(text))

50000


A simple way to pretrain is to train a _language model_. This is model that tries to predict the next word in a sentence. For instance, given:

In [38]:
" ".join(text[2].split()[:28])

'This is still a pretty bad and silly simplistic typical slasher but when being compared to the previous sequel "Slumber Party Massacre II" this movie is a step'

can you guess the next word?

In a language model, we consider the above the input, and the out is:

In [39]:
text[2].split()[28]

'up'

The input we encode using the embedding, while the output is a probability map over words.
In other words, the last layer is something like `nn.Linear(..., len(vocab))`.

### Exercise 2 (optional)

Train a language model.

After you have trained the model, try to make to complete sentences.

## Transfer learning
### Exercise 3 (optional)

Discard the last layer of the now trained language model and use it to train on the original IMDB-problem.