# RNN and Natural Language Processing

In this exercise we will try to classify IMDB dataset: Given the text of a review, can you predict if the review was positive or negative?

Before doing this exercise, you might want to become more familier with LSTMs by considering the example FlightPassengerPredictions.

The data for this exercise can be found here:
https://sid.erda.dk/share_redirect/encok5nw3y

***

Author: Julius Kirkegaard<br>
Date: 15th of May 2022

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import pickle
from collections import defaultdict
from torch import nn
import json
from torch.utils.data import DataLoader
from tqdm import tqdm
from itertools import chain
from collections import Counter

device = "cuda" if torch.cuda.is_available() else "cpu"

## Data

In [2]:
limit_data = 10000  # limit the amount of data for speed, change as you please

def remove_special_symbols(string):
    return ''.join(s for s in string if ord(s)>96 and ord(s)<123 or s == ' ')

with open('train.json') as f:
    train_text, train_labels = json.load(f)
    idxs = np.random.permutation(len(train_text))
    train_text = [remove_special_symbols(train_text[i]) for i in idxs[:limit_data]]
    train_labels = [train_labels[i] for i in idxs[:limit_data]]
    
with open('test.json') as f:
    test_text, test_labels = json.load(f)
    idxs = np.random.permutation(len(test_text))   
    test_text = [remove_special_symbols(test_text[i]) for i in idxs[:limit_data]]
    test_labels = [test_labels[i] for i in idxs[:limit_data]]

Let's have a look at the data... Here is a negative review (label = 0)

In [3]:
print(train_text[0])
print(train_labels[0])

nother merican ie movie has been shoved down our throats and this one is the worst one of them all t doesnt deserve the name merican ie hey should have stopped at he eddingbr br his movie feels like just a stupid porn movie which they slapped the title merican ie on hen i was watching this i felt like i was watching a different series t doesnt fell like merican ie at all t has different humor and it is much more rude and has many more sex scenes then the other merican ie moviesbr br  dont recommend it ever ctually i dont recommend any of the merican ie resents movies ust stick with the nice original trilogybr br 
0


And a positive one:

In [4]:
print(train_text[3])
print(train_labels[3])

ot illions is a delightful comedy that is made even better by the presence of the marvelous cast assembled for it he movie is a tribute to the genius of eter stinov who wrote the screen play and appears as the key figure of an enterprising embezzler he movie directed by ric ill doesnt show signs of having dated as terribly as some others from that periodbr br t the center of the action is a friendly man arcus endleton who before being released from prison fixes the income tax forms for the warden who is amazed of the refund he is owed by the government arcus who is a genius at numbers sees opportunities where others wouldnt e starts working for a firm that uses the latest computer for its accounting but endleton is a resourceful man who finds a way to take advantage of the system and establishes different phony accounts in different parts of the continentbr br arcus is assigned a secretary who also happens to have a flat in his building he inept atty is seen working as a bus fare taker

## Embedding

Let's have a look at all the words we have:

In [5]:
all_words = list(chain(*[x.lower().split() for x in train_text]))
print('total number of unique words =', len(set(all_words)))

total number of unique words = 74841


That's a lot of words... of course, we could clean the data even more if we wanted to. But we won't...
(for instance, there are probably many misspelled words)

In [6]:
Counter(all_words)

Counter({'nother': 247,
         'merican': 786,
         'ie': 175,
         'movie': 16495,
         'has': 6648,
         'been': 3698,
         'shoved': 14,
         'down': 1352,
         'our': 1064,
         'throats': 14,
         'and': 62529,
         'this': 23928,
         'one': 9335,
         'is': 43039,
         'the': 115592,
         'worst': 1039,
         'of': 57591,
         'them': 3110,
         'all': 8500,
         't': 5489,
         'doesnt': 1803,
         'deserve': 117,
         'name': 598,
         'hey': 1273,
         'should': 1950,
         'have': 11141,
         'stopped': 86,
         'at': 8870,
         'he': 27689,
         'eddingbr': 1,
         'br': 23132,
         'his': 17280,
         'feels': 347,
         'like': 7647,
         'just': 6670,
         'a': 63015,
         'stupid': 621,
         'porn': 148,
         'which': 4584,
         'they': 7198,
         'slapped': 21,
         'title': 576,
         'on': 13316,
         'he

In [7]:
n_words = 25000   # let's make a model that only understand 25000 words

In [8]:
words, count = np.unique(all_words, return_counts=True)
idxs = np.argsort(count)[-n_words:]
vocab = ['<UNK>'] + list(words[idxs][::-1])
print(vocab[:5], '...', vocab[-5:])

['<UNK>', 'the', 'a', 'and', 'of'] ... ['superego', 'umbles', 'harried', 'bluffs', 'slicker']


Not very surprisingly, the most commonly used word is _the_. The 25000th most used word is _chimneys_. We have added a special word `<UNK>` which we will used to mark words outside our vocabulary.

We can now turn a sentence into a sequence of integers that correspond to the position in the vocab.

In [9]:
vocab_d = {vocab[i]: i for i in range(len(vocab))}  # for quick look-up
def sentence_to_integer_sequence(s):
    return torch.tensor([vocab_d[x] if x in vocab_d else 0 for x in s.split()], dtype=torch.long)

In [10]:
sentence_to_integer_sequence("i really liked the movie xenopus51")

tensor([140,  58, 366,   1,  18,   0])

We are now representing words in a "25000"-dimensional space: we have a unique integer for each word we can represent. To reduce this complexity, we instead intend to represent each word as a 50-dimensional real vector. Pytorch to the rescue:

In [11]:
embedding = nn.Embedding(len(vocab), 50)

`nn.Embedding` assigns are random, trainable vector to each word. For instance:

In [12]:
print(embedding(sentence_to_integer_sequence("movie")))

tensor([[ 1.5061, -2.8431,  0.1609, -0.0623, -0.6275, -0.0611, -0.6655, -0.5654,
          2.1624,  0.0422,  0.5315, -2.7759,  0.1698,  2.3112, -0.8031,  0.0342,
          0.8473,  0.1359, -0.6043, -0.0755, -0.2010,  0.4843, -1.3758, -0.6464,
          0.8647, -0.4374,  1.3393, -1.2365,  0.6589, -1.4002,  0.9509, -1.3786,
         -0.7353, -0.2399,  0.0047, -0.3996,  1.4220,  0.5730,  1.0574, -0.0103,
          1.7864,  1.3318, -0.7880,  0.0426, -0.4476, -0.1906,  1.9406,  0.3124,
          3.5870, -1.0896]], grad_fn=<EmbeddingBackward0>)


The special `<UNK>` word, which signifies unknown we can choose to zero out:

In [13]:
embedding.weight.data[0, :] = 0

Here is an example of how we represent a sentence then:

In [14]:
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")).shape)
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")))

torch.Size([6, 50])
tensor([[ 1.2879e+00,  5.6446e-01, -1.2645e+00,  1.5018e+00, -5.6765e-01,
         -6.5932e-01,  1.8361e+00,  7.8959e-01,  2.7334e+00,  2.6503e-01,
         -1.0897e+00,  6.7540e-01, -4.5790e-02,  7.7984e-02,  9.9163e-01,
         -1.9653e-01, -6.6368e-01,  1.2701e+00, -2.0874e+00, -8.4387e-02,
          2.1355e+00, -1.1942e+00,  1.7217e+00, -1.6965e+00,  9.2238e-01,
          6.8587e-01, -1.3995e+00, -3.3029e-02,  3.0750e-01, -1.4805e+00,
          7.1878e-01, -4.4912e-01, -1.6127e+00, -2.4040e+00,  8.1668e-01,
         -4.2567e-01, -1.7806e-01,  9.2791e-02,  4.2006e-01, -2.1876e-01,
         -1.5945e+00,  2.0227e-01,  8.2907e-01, -1.6544e-01,  1.5224e+00,
          7.1911e-01,  9.5552e-01, -1.8039e+00, -1.4575e+00,  1.6158e-01],
        [ 9.9861e-01, -1.0898e+00,  1.2775e-01, -3.9753e-01, -5.4287e-01,
          6.4440e-01,  1.4922e+00, -2.5181e-01,  1.3588e+00, -1.2189e-01,
         -1.6739e-01, -2.9228e+00,  1.4605e-01,  6.2409e-01, -1.0214e-01,
         -3.0545e

In this way, the sentence bascially becomes a 6x50 pixel image.

You can now use `nn.LSTM` after an embedding to define a neural network for sentenes.

This network will have a _lot_ of parameters. For each word, 50 parameters needs to be trained, and then comes the LSTM on top of that.
This is the reason that _transfer learning_ is so important in natural language processing (NLP).

Perhaps the simplest form of transfer learning is to use a pretrained embedding layer.

In [15]:
with open('glove.6B.50d.pkl', 'rb') as f:
    glove = pickle.load(f)

In [16]:
glove['movie']

array([ 0.30824 ,  0.17223 , -0.23339 ,  0.023105,  0.28522 ,  0.23076 ,
       -0.41048 , -1.0035  , -0.2072  ,  1.4327  , -0.80684 ,  0.68954 ,
       -0.43648 ,  1.1069  ,  1.6107  , -0.31966 ,  0.47744 ,  0.79395 ,
       -0.84374 ,  0.064509,  0.90251 ,  0.78609 ,  0.29699 ,  0.76057 ,
        0.433   , -1.5032  , -1.6423  ,  0.30256 ,  0.30771 , -0.87057 ,
        2.4782  , -0.025852,  0.5013  , -0.38593 , -0.15633 ,  0.45522 ,
        0.04901 , -0.42599 , -0.86402 , -1.3076  , -0.29576 ,  1.209   ,
       -0.3127  , -0.72462 , -0.80801 ,  0.082667,  0.26738 , -0.98177 ,
       -0.32147 ,  0.99823 ])

These pretrained word embeddings will have a good structure to them. For instance:

In [17]:
print('Distance b/w queen and prince =', np.linalg.norm(glove['queen'] - glove['prince']))
print('Distance b/w movie and prince =', np.linalg.norm(glove['movie'] - glove['prince']))

Distance b/w queen and prince = 3.3926491063677284
Distance b/w movie and prince = 6.450784821820618


You can even sometimes get away with doing algebra with these vectors:

In [18]:
queenlike = glove['king'] - glove['man'] + glove['woman']

In [19]:
print('Distance b/w queen and algebraic queen =', np.linalg.norm(glove['queen'] - queenlike))
print('Distance b/w queen and king =', np.linalg.norm(glove['queen'] - glove['king']))

Distance b/w queen and algebraic queen = 2.8391206432941996
Distance b/w queen and king = 3.4777562289742345


We can fill out our embedding layer using these pretrained vectors:

In [20]:
filled = 0
not_found = []
for i, w in enumerate(vocab):
    if w in glove:
        embedding.weight.data[i, :] = torch.tensor(glove[w], dtype=torch.float)
        filled += 1
    else:
        not_found.append(w)
print(f'Of the words in the vocab, {100 * filled / (len(vocab) - 1)} % were updated using Glove vectors')
print('Examples of words not found :', not_found[1:7])

Of the words in the vocab, 73.692 % were updated using Glove vectors
Examples of words not found : ['owever', 'ichael', 'itbr', 'eorge', 'nglish', 'ritish']


Clearly the words we do not find are due to misspellings. You now have three choices before you continue with the exercise:

(1) Use _hunspell_ or similar to fix misspelled words

(2) Change the vocabulary (`vocab`) to be based on words that are both frequent in the text and have glove vectors

(3) Ignore the issue.

Finally, with Glove vectors we do not need to train the embedding. We can consider it fixed:

In [21]:
embedding.requires_grad_(False)

Embedding(25001, 50)

### Exercise 1

Train a neural network using `nn.LSTM` layers to classify the IMDB reviews.

Ideas:

 - The sentences can be quite long, so you might want to limit them to, say, maximum 100 words
 - `nn.LSTM` can take batched input, but be careful with `batch_first=True/False`.
 - If batched input are used, they normally have to be equal in size. You can put sentences to always be 100 words long, by adding `<UNK>` words to short sentences.
 - Alternatively, `nn.LSTM` can also accept batched, variable-length input using `torch.nn.utils.rnn.pack_sequence`.
 - (a finaly, albeit slow, alternative is to simple run the LSTM on un-batched input)


## Language model

Often in NLP, a lot of unlabelled text is available. We can use this to pretrain the model, before fine-tuning to the task at hand.

In [37]:
with open('unlabelled.json') as f:
    text = json.load(f)
print(len(text))

50000


A simple way to pretrain is to train a _language model_. This is model that tries to predict the next word in a sentence. For instance, given:

In [38]:
" ".join(text[2].split()[:28])

'This is still a pretty bad and silly simplistic typical slasher but when being compared to the previous sequel "Slumber Party Massacre II" this movie is a step'

can you guess the next word?

In a language model, we consider the above the input, and the out is:

In [39]:
text[2].split()[28]

'up'

The input we encode using the embedding, while the output is a probability map over words.
In other words, the last layer is something like `nn.Linear(..., len(vocab))`.

### Exercise 2 (optional)

Train a language model.

After you have trained the model, try to make to complete sentences.

## Transfer learning
### Exercise 3 (optional)

Discard the last layer of the now trained language model and use it to train on the original IMDB-problem.