# RNN and Natural Language Processing

In this exercise we will try to classify IMDB dataset: Given the text of a review, can you predict if the review was positive or negative?

Before doing this exercise, you might want to become more familier with LSTMs by considering the example FlightPassengerPredictions.

The data for this exercise can be found here:
https://sid.erda.dk/share_redirect/encok5nw3y

***

Author: Julius Kirkegaard and Troels C. Petersen<br>
Date: 14th of May 2023

In [25]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import pickle
from collections import defaultdict
from torch import nn
import json
from torch.utils.data import DataLoader
from tqdm import tqdm
from itertools import chain
from collections import Counter

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cpu


## Data

In [2]:
limit_data = 10000  # limit the amount of data for speed, change as you please

def remove_special_symbols(string):
    return ''.join(s for s in string if ord(s)>96 and ord(s)<123 or s == ' ')

with open('train.json') as f:
    train_text, train_labels = json.load(f)
    idxs = np.random.permutation(len(train_text))
    train_text = [remove_special_symbols(train_text[i]) for i in idxs[:limit_data]]
    train_labels = [train_labels[i] for i in idxs[:limit_data]]
    
with open('test.json') as f:
    test_text, test_labels = json.load(f)
    idxs = np.random.permutation(len(test_text))   
    test_text = [remove_special_symbols(test_text[i]) for i in idxs[:limit_data]]
    test_labels = [test_labels[i] for i in idxs[:limit_data]]

Let's have a look at the data... Here is a negative review (label = 0)

In [3]:
print(train_text[0])
print(train_labels[0])

 really have to disagree with guyyardleyrees who should he have watched the entire film would have seen some absolutely stunning cottish scenery some of the best ever shot in kye and found a film with a difficult start come together into a really poignant wholebr br his is not a big budget film ather it is a film that has a strong community feelbr br  cant say how much standard films bore me  pushing out the same polished stuff again and again eachd doesnt seem to be about that at all t really seems to be trying to offer something more real and certainly more aelic than any recent cottish filmbr br  so the acting isnt in the style a blockbuster hats because the actors are seemingly real people  actually thought that the key roles of the boy and his randfather were really convincing  and at times unusually beautifulbr br eachd really bears a second viewing since there are many threads that become clearer second time around  that really do feed into the endingbr br verall the combination

And a positive one:

In [4]:
print(train_text[3])
print(train_labels[3])

 had tried to rent this on many occasions but was always with the girlfriend who as a general rule usually rejects heist flicks and ensemble comedies with the comment hm looks good but im not in the mood for that movie hus entereth the lmighty olo ovie ightbr br nyway  found elcome o ollinwood a rather enjoyable movie hile ultimately fairly forgettable it does have moments of fun and a few laugh out loud moments  was unfamiliar with the fact that it was a remake and as a general rule watch movies trying to ignore that fact and watch them on their own merits anyway eorge looney puts in a humorous and brief cameo as a wheeled safe cracker that for the most part left me wondering two things  wouldnt every comedy be better if r looney put in a strange  minute cameo and  ow do they make fake tattoos that look old and faded and how easily do they wash off he cast all fine actors in their own right put in a great job and you get the impression that they had a good time working together which 

## Embedding

Let's have a look at all the words we have:

In [5]:
all_words = list(chain(*[x.lower().split() for x in train_text]))
print('total number of unique words =', len(set(all_words)))

total number of unique words = 74948


That's a lot of words... of course, we could clean the data even more if we wanted to. But we won't...
(for instance, there are probably many misspelled words)

In [6]:
Counter(all_words)

Counter({'really': 4494,
         'have': 10927,
         'to': 53463,
         'disagree': 45,
         'with': 17272,
         'guyyardleyrees': 1,
         'who': 7875,
         'should': 1968,
         'he': 27833,
         'watched': 867,
         'the': 116444,
         'entire': 581,
         'film': 14841,
         'would': 4796,
         'seen': 2523,
         'some': 5743,
         'absolutely': 576,
         'stunning': 154,
         'cottish': 37,
         'scenery': 151,
         'of': 57960,
         'best': 2314,
         'ever': 2286,
         'shot': 765,
         'in': 35012,
         'kye': 4,
         'and': 62522,
         'found': 1058,
         'a': 62902,
         'difficult': 269,
         'start': 645,
         'come': 1195,
         'together': 839,
         'into': 3661,
         'poignant': 64,
         'wholebr': 11,
         'br': 23307,
         'his': 16856,
         'is': 43046,
         'not': 11269,
         'big': 1157,
         'budget': 572,
     

In [7]:
n_words = 25000   # let's make a model that only understand 25000 words

In [8]:
words, count = np.unique(all_words, return_counts=True)
idxs = np.argsort(count)[-n_words:]
vocab = ['<UNK>'] + list(words[idxs][::-1])
print(vocab[:5], '...', vocab[-5:])

['<UNK>', 'the', 'a', 'and', 'of'] ... ['yoursbr', 'beatles', 'stripclub', 'stripes', 'onhamarter']


Not very surprisingly, the most commonly used word is _the_. The 25000th most used word is _chimneys_. We have added a special word `<UNK>` which we will used to mark words outside our vocabulary.

We can now turn a sentence into a sequence of integers that correspond to the position in the vocab.

In [9]:
vocab_d = {vocab[i]: i for i in range(len(vocab))}  # for quick look-up
def sentence_to_integer_sequence(s):
    return torch.tensor([vocab_d[x] if x in vocab_d else 0 for x in s.split()], dtype=torch.long)

In [10]:
sentence_to_integer_sequence("i really liked the movie xenopus51")

tensor([143,  56, 421,   1,  18,   0])

#### Representing high dimensional spaces with a simpler (learnable) embedding:

We are now representing words in a "25000"-dimensional space: we have a unique integer for each word we can represent. To reduce this complexity, we instead intend to represent each word as a 50-dimensional real vector. Pytorch to the rescue:

In [11]:
embedding = nn.Embedding(len(vocab), 50)

`nn.Embedding` assigns are random, **but trainable** vector to each word. For instance:

In [12]:
print(embedding(sentence_to_integer_sequence("movie")))

tensor([[ 0.4881, -0.0029, -0.1876, -0.5179,  0.4515, -1.1402,  1.5735, -0.0110,
         -1.3785,  0.6314,  2.2267,  0.4240, -1.0660, -0.3033,  0.0306,  0.3344,
          1.4658, -0.0479, -0.6608,  0.1049,  0.7835, -0.1623,  1.1444,  0.9846,
          0.4603,  1.0205, -2.0119, -1.2422,  1.6663,  0.5825, -1.9271, -1.1698,
         -0.7586,  0.1706, -1.2880,  1.2837, -1.5307, -2.1710,  0.7318, -2.7492,
         -1.1494,  0.1715,  0.5571, -0.6990,  0.6971, -0.7531,  0.6472, -0.4681,
          0.5629,  0.3255]], grad_fn=<EmbeddingBackward>)


The special `<UNK>` word, which signifies unknown we can choose to zero out:

In [13]:
embedding.weight.data[0, :] = 0

Here is an example of how we represent a sentence then:

In [14]:
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")).shape)
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")))

torch.Size([6, 50])
tensor([[ 5.5641e-01, -1.6343e+00,  1.0628e+00, -1.9976e-02,  3.3113e-01,
          1.8641e+00, -6.5232e-01, -8.9025e-01,  3.0268e-01,  5.7124e-01,
          1.0920e+00,  3.2339e-01, -4.3906e-01, -6.8665e-01,  3.9110e-02,
          9.7126e-01,  2.5054e+00, -4.2875e-01,  8.8234e-02,  1.0390e+00,
         -6.1140e-01, -1.5174e-01, -1.2417e+00,  1.9044e+00, -1.0692e+00,
          9.5485e-01,  1.4346e+00, -2.7633e-01, -1.5929e+00, -1.8759e+00,
          4.8675e-02,  1.1139e+00,  1.1466e+00, -4.5798e-02,  1.0682e+00,
          7.2928e-01, -2.0445e-01, -7.0702e-02,  3.6502e-01, -1.5810e+00,
          7.2825e-01, -1.5468e-01,  6.4691e-01,  7.4563e-01,  6.1608e-02,
         -2.4876e-01, -1.2650e+00, -3.1478e-01, -2.3173e+00, -8.7808e-01],
        [ 9.6347e-01,  1.7825e-01,  3.2580e-01, -1.7448e+00,  4.0648e-01,
          3.3872e-01, -3.3913e-02, -1.1948e+00,  5.7852e-01, -5.8138e-01,
          1.7930e+00, -6.4690e-01, -1.1121e+00, -2.0914e-01,  7.0068e-01,
          2.1571e

In this way, the sentence bascially becomes a 6x50 pixel image.

You can now use `nn.LSTM` after an embedding to define a neural network for sentenes.

This network will have a _lot_ of parameters. For each word, 50 parameters needs to be trained, and then comes the LSTM on top of that.
This is the reason that _transfer learning_ is so important in natural language processing (NLP).

Perhaps the simplest form of transfer learning is to use a pretrained embedding layer.

In [15]:
with open('glove.6B.50d.pkl', 'rb') as f:
    glove = pickle.load(f)

In [16]:
glove['movie']

array([ 0.30824 ,  0.17223 , -0.23339 ,  0.023105,  0.28522 ,  0.23076 ,
       -0.41048 , -1.0035  , -0.2072  ,  1.4327  , -0.80684 ,  0.68954 ,
       -0.43648 ,  1.1069  ,  1.6107  , -0.31966 ,  0.47744 ,  0.79395 ,
       -0.84374 ,  0.064509,  0.90251 ,  0.78609 ,  0.29699 ,  0.76057 ,
        0.433   , -1.5032  , -1.6423  ,  0.30256 ,  0.30771 , -0.87057 ,
        2.4782  , -0.025852,  0.5013  , -0.38593 , -0.15633 ,  0.45522 ,
        0.04901 , -0.42599 , -0.86402 , -1.3076  , -0.29576 ,  1.209   ,
       -0.3127  , -0.72462 , -0.80801 ,  0.082667,  0.26738 , -0.98177 ,
       -0.32147 ,  0.99823 ])

These pretrained word embeddings will have a good structure to them. For instance:

In [17]:
print('Distance b/w queen and prince =', np.linalg.norm(glove['queen'] - glove['prince']))
print('Distance b/w movie and prince =', np.linalg.norm(glove['movie'] - glove['prince']))

Distance b/w queen and prince = 3.3926491063677284
Distance b/w movie and prince = 6.450784821820618


You can even sometimes get away with doing algebra with these vectors:

In [18]:
queenlike = glove['king'] - glove['man'] + glove['woman']

In [19]:
print('Distance b/w queen and algebraic queen =', np.linalg.norm(glove['queen'] - queenlike))
print('Distance b/w queen and king =', np.linalg.norm(glove['queen'] - glove['king']))

Distance b/w queen and algebraic queen = 2.8391206432941996
Distance b/w queen and king = 3.4777562289742345


We can fill out our embedding layer using these pretrained vectors:

In [20]:
filled = 0
not_found = []
for i, w in enumerate(vocab):
    if w in glove:
        embedding.weight.data[i, :] = torch.tensor(glove[w], dtype=torch.float)
        filled += 1
    else:
        not_found.append(w)
print(f'Of the words in the vocab, {100 * filled / (len(vocab) - 1)} % were updated using Glove vectors')
print('Examples of words not found :', not_found[1:7])

Of the words in the vocab, 73.844 % were updated using Glove vectors
Examples of words not found : ['owever', 'ichael', 'itbr', 'aybe', 'nglish', 'nfortunately']


Clearly the words we do not find are due to misspellings. You now have three choices before you continue with the exercise:

(1) Use _hunspell_ or similar to fix misspelled words

(2) Change the vocabulary (`vocab`) to be based on words that are both frequent in the text and have glove vectors

(3) Ignore the issue.

Finally, with Glove vectors we do not need to train the embedding. We can consider it fixed:

In [21]:
embedding.requires_grad_(False)

Embedding(25001, 50)

### Exercise 1

Train a neural network using `nn.LSTM` layers to classify the IMDB reviews.

Ideas:

 - The sentences can be quite long, so you might want to limit them to, say, maximum 100 words
 - `nn.LSTM` can take batched input, but be careful with `batch_first=True/False`.
 - If batched input are used, they normally have to be equal in size. You can put sentences to always be 100 words long, by adding `<UNK>` words to short sentences.
 - Alternatively, `nn.LSTM` can also accept batched, variable-length input using `torch.nn.utils.rnn.pack_sequence`.
 - (a finaly, albeit slow, alternative is to simple run the LSTM on un-batched input)


## Language model

Often in NLP, a lot of unlabelled text is available. We can use this to pretrain the model, before fine-tuning to the task at hand.

In [22]:
with open('unlabelled.json') as f:
    text = json.load(f)
print(len(text))

50000


A simple way to pretrain is to train a _language model_. This is a model that tries to predict the next word in a sentence. For instance, given:

In [23]:
" ".join(text[2].split()[:28])

'This is still a pretty bad and silly simplistic typical slasher but when being compared to the previous sequel "Slumber Party Massacre II" this movie is a step'

can you guess the next word?

In a language model, we consider the above the input, and the output is:

In [24]:
text[2].split()[28]

'up'

The input we encode using the embedding, while the output is a probability map over words.
In other words, the last layer is something like `nn.Linear(..., len(vocab))`.

### Exercise 2 (optional)

Train a language model.

After you have trained the model, try to make to complete sentences.

## Transfer learning
### Exercise 3 (optional)

Discard the last layer of the now trained language model and use it to train on the original IMDB-problem.