# RNN and Natural Language Processing

In this exercise we will try to classify IMDB dataset: Given the text of a review, can you predict if the review was positive or negative?

Before doing this exercise, you might want to become more familier with LSTMs by considering the example FlightPassengerPredictions.

The data for this exercise can be found here:
https://sid.erda.dk/share_redirect/encok5nw3y

***

Author: Julius Kirkegaard and Troels C. Petersen<br>
Date: 14th of May 2023

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import pickle
from collections import defaultdict
from torch import nn
import json
from torch.utils.data import DataLoader
from tqdm import tqdm
from itertools import chain
from collections import Counter

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cpu


## Data

In [3]:
limit_data = 10000  # limit the amount of data for speed, change as you please

def remove_special_symbols(string):
    return ''.join(s for s in string if ord(s)>96 and ord(s)<123 or s == ' ')

with open('train.json') as f:
    train_text, train_labels = json.load(f)
    idxs = np.random.permutation(len(train_text))
    train_text = [remove_special_symbols(train_text[i]) for i in idxs[:limit_data]]
    train_labels = [train_labels[i] for i in idxs[:limit_data]]
    
with open('test.json') as f:
    test_text, test_labels = json.load(f)
    idxs = np.random.permutation(len(test_text))   
    test_text = [remove_special_symbols(test_text[i]) for i in idxs[:limit_data]]
    test_labels = [test_labels[i] for i in idxs[:limit_data]]

Let's have a look at the data... Here is a negative review (label = 0)

In [4]:
print(train_text[0])
print(train_labels[0])

hilip  ickian movie nd a decent one for that matter etter than the aycheck oo and that abomination called inority eport pielberg ut lets face it the twisting and cheesing ending was a bit too much for me alf way through the movie  already started to fear about such kind of ending and  was regrettably right ut that does not mean that the film is not worth its time o not at all irst half as already many here have commented is awesome here are some parts where you start to doubt whether the director intended to convey the message that showmanship is highly important thing in the future we will do such kind on corny sf things because we  or is it simply over combining ut the paranoia is there and feeling out of joint also ood one
0


And a positive one:

In [5]:
print(train_text[3])
print(train_labels[3])

and amp was awful he aked ile was a little better and this third straight to  in the merican ie franchise seems the same quality as the predecessor asically rik tifler ohn hite split from his girlfriend after losing his virginity and now him and ike ooze oozeman ake iegel are joining riks cousin wight teve alley at college ith the promise of many parties plenty of booze and enough hot chicks at the eta ouse they only have fifty listed tasks to carry out to become official privileged members ut a threat comes into sight with the rivals  eek ouse led by powerhungry nerd and sheep shagger dgar yrone avage offering bigger and better than what eta have o settle it once and for all eta and ek go into battle with the banned for forty years reek ames to beat each other in with the loser moving out he last champion of the games oah evenstein aka ims ad the only regular ugene evy runs the show which sees the people unhooking bras a gladiator duel floating on water catching a greased pig ussian o

## Embedding

Let's have a look at all the words we have:

In [6]:
all_words = list(chain(*[x.lower().split() for x in train_text]))
print('total number of unique words =', len(set(all_words)))

total number of unique words = 74445


That's a lot of words... of course, we could clean the data even more if we wanted to. But we won't...
(for instance, there are probably many misspelled words)

In [7]:
Counter(all_words)

Counter({'the': 114541,
         'a': 62345,
         'and': 61505,
         'of': 57213,
         'to': 53223,
         'is': 43213,
         'in': 35007,
         'he': 27608,
         'that': 26795,
         'it': 26019,
         'this': 24149,
         'br': 23360,
         'was': 18898,
         'as': 17414,
         'with': 17101,
         'his': 17100,
         'for': 16926,
         'movie': 16521,
         'film': 14704,
         'but': 13720,
         'on': 13381,
         'are': 11623,
         'have': 11019,
         'not': 10945,
         'you': 10600,
         'be': 10515,
         'one': 9520,
         'an': 9091,
         'at': 8927,
         'by': 8753,
         'all': 8449,
         'who': 8073,
         'from': 7801,
         'or': 7762,
         'like': 7607,
         'its': 7407,
         'they': 7114,
         'so': 6866,
         'about': 6819,
         'her': 6795,
         'just': 6609,
         'has': 6570,
         'out': 6472,
         'some': 5698,
        

In [8]:
n_words = 25000   # let's make a model that only understand 25000 words

In [9]:
words, count = np.unique(all_words, return_counts=True)
idxs = np.argsort(count)[-n_words:]
vocab = ['<UNK>'] + list(words[idxs][::-1])
print(vocab[:5], '...', vocab[-5:])

['<UNK>', 'the', 'a', 'and', 'of'] ... ['bumblers', 'meki', 'lowp', 'stonecold', 'vying']


Not very surprisingly, the most commonly used word is _the_. The 25000th most used word is _chimneys_. We have added a special word `<UNK>` which we will used to mark words outside our vocabulary.

We can now turn a sentence into a sequence of integers that correspond to the position in the vocab.

In [10]:
vocab_d = {vocab[i]: i for i in range(len(vocab))}  # for quick look-up
def sentence_to_integer_sequence(s):
    return torch.tensor([vocab_d[x] if x in vocab_d else 0 for x in s.split()], dtype=torch.long)

In [11]:
sentence_to_integer_sequence("i really liked the movie xenopus51")

tensor([138,  56, 436,   1,  18,   0])

#### Representing high dimensional spaces with a simpler (learnable) embedding:

We are now representing words in a "25000"-dimensional space: we have a unique integer for each word we can represent. To reduce this complexity, we instead intend to represent each word as a 50-dimensional real vector. Pytorch to the rescue:

In [12]:
embedding = nn.Embedding(len(vocab), 50)

`nn.Embedding` assigns are random, **but trainable** vector to each word. For instance:

In [13]:
print(embedding(sentence_to_integer_sequence("movie")))

tensor([[ 0.3830,  1.0040,  0.8199, -0.1486, -0.4464,  0.8554,  0.9904, -0.2797,
         -1.4488,  0.3964,  0.8499, -0.1861, -0.5887, -0.0169, -0.4246,  0.4436,
          0.3712,  1.0042,  0.2687, -0.5322, -0.8029,  1.0065, -0.6357, -1.4029,
          1.3472,  1.2819,  0.2463, -0.1311, -1.1287, -0.4313,  2.2918,  1.2813,
          0.7973,  1.4662, -0.6910,  1.6100, -2.2428,  1.0082,  0.7776, -1.0881,
          0.7889, -1.8485,  0.0176, -1.8129, -0.7277,  0.0706, -1.1938, -0.3545,
         -0.0350,  0.2063]], grad_fn=<EmbeddingBackward0>)


The special `<UNK>` word, which signifies unknown we can choose to zero out:

In [14]:
embedding.weight.data[0, :] = 0

Here is an example of how we represent a sentence then:

In [15]:
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")).shape)
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")))

torch.Size([6, 50])
tensor([[ 1.8757e-01,  1.1478e+00, -4.0609e-01,  7.9164e-01, -1.3440e+00,
          1.3291e+00,  2.7436e-01,  1.4376e+00,  6.3498e-01, -5.0623e-01,
         -8.0561e-01, -7.7071e-01,  2.1853e+00, -1.0426e+00,  3.4722e-01,
          5.4782e-02,  7.2509e-01,  1.1170e+00,  8.1339e-01, -8.6629e-01,
         -3.0913e-01,  1.5571e+00, -1.4386e-01, -1.0269e+00,  6.0549e-01,
         -7.4567e-01, -1.0668e+00, -8.5426e-01, -6.5357e-01,  1.5315e+00,
          6.4485e-01, -3.7481e-01, -4.5751e-01, -2.4071e-01, -2.0029e+00,
          2.5380e+00, -3.6887e-01, -1.7410e+00, -1.0699e+00, -5.6962e-01,
         -2.9811e-01,  3.5479e-01, -1.4555e+00, -8.3069e-01,  1.0354e+00,
         -1.2415e-01, -6.7981e-01, -2.9255e-01, -5.1272e-01, -1.0905e+00],
        [-3.1909e-01, -3.5195e-01, -7.2693e-02,  1.3847e-01,  2.6423e-03,
          1.3508e+00, -2.1623e+00,  1.5713e+00, -6.1308e-01,  2.5192e-01,
         -8.1770e-01,  1.4639e+00,  2.3467e-01, -9.4482e-01, -4.9316e-01,
          5.9213e

In this way, the sentence bascially becomes a 6x50 pixel image.

You can now use `nn.LSTM` after an embedding to define a neural network for sentenes.

This network will have a _lot_ of parameters. For each word, 50 parameters needs to be trained, and then comes the LSTM on top of that.
This is the reason that _transfer learning_ is so important in natural language processing (NLP).

Perhaps the simplest form of transfer learning is to use a pretrained embedding layer.

In [16]:
with open('glove.6B.50d.pkl', 'rb') as f:
    glove = pickle.load(f)

In [17]:
glove['movie']

array([ 0.30824 ,  0.17223 , -0.23339 ,  0.023105,  0.28522 ,  0.23076 ,
       -0.41048 , -1.0035  , -0.2072  ,  1.4327  , -0.80684 ,  0.68954 ,
       -0.43648 ,  1.1069  ,  1.6107  , -0.31966 ,  0.47744 ,  0.79395 ,
       -0.84374 ,  0.064509,  0.90251 ,  0.78609 ,  0.29699 ,  0.76057 ,
        0.433   , -1.5032  , -1.6423  ,  0.30256 ,  0.30771 , -0.87057 ,
        2.4782  , -0.025852,  0.5013  , -0.38593 , -0.15633 ,  0.45522 ,
        0.04901 , -0.42599 , -0.86402 , -1.3076  , -0.29576 ,  1.209   ,
       -0.3127  , -0.72462 , -0.80801 ,  0.082667,  0.26738 , -0.98177 ,
       -0.32147 ,  0.99823 ])

These pretrained word embeddings will have a good structure to them. For instance:

In [20]:
print('Distance b/w queen and prince =', np.linalg.norm(glove['queen'] - glove['prince']))
print('Distance b/w movie and prince =', np.linalg.norm(glove['movie'] - glove['prince']))

Distance b/w queen and prince = 3.3926491063677284
Distance b/w movie and prince = 6.450784821820618


You can even sometimes get away with doing algebra with these vectors:

In [21]:
queenlike = glove['king'] - glove['man'] + glove['woman']

In [22]:
print('Distance b/w queen and algebraic queen =', np.linalg.norm(glove['queen'] - queenlike))
print('Distance b/w queen and king =', np.linalg.norm(glove['queen'] - glove['king']))

Distance b/w queen and algebraic queen = 2.8391206432941996
Distance b/w queen and king = 3.4777562289742345


We can fill out our embedding layer using these pretrained vectors:

In [23]:
filled = 0
not_found = []
for i, w in enumerate(vocab):
    if w in glove:
        embedding.weight.data[i, :] = torch.tensor(glove[w], dtype=torch.float)
        filled += 1
    else:
        not_found.append(w)
print(f'Of the words in the vocab, {100 * filled / (len(vocab) - 1)} % were updated using Glove vectors')
print('Examples of words not found :', not_found[1:7])

Of the words in the vocab, 74.0 % were updated using Glove vectors
Examples of words not found : ['owever', 'ichael', 'itbr', 'nglish', 'eorge', 'aybe']


Clearly the words we do not find are due to misspellings. You now have three choices before you continue with the exercise:

(1) Use _hunspell_ or similar to fix misspelled words

(2) Change the vocabulary (`vocab`) to be based on words that are both frequent in the text and have glove vectors

(3) Ignore the issue.

Finally, with Glove vectors we do not need to train the embedding. We can consider it fixed:

In [24]:
embedding.requires_grad_(False)

Embedding(25001, 50)

### Exercise 1

Train a neural network using `nn.LSTM` layers to classify the IMDB reviews.

Ideas:

 - The sentences can be quite long, so you might want to limit them to, say, maximum 100 words
 - `nn.LSTM` can take batched input, but be careful with `batch_first=True/False`.
 - If batched input are used, they normally have to be equal in size. You can put sentences to always be 100 words long, by adding `<UNK>` words to short sentences.
 - Alternatively, `nn.LSTM` can also accept batched, variable-length input using `torch.nn.utils.rnn.pack_sequence`.
 - (a finaly, albeit slow, alternative is to simple run the LSTM on un-batched input)


## Language model

Often in NLP, a lot of unlabelled text is available. We can use this to pretrain the model, before fine-tuning to the task at hand.

In [25]:
with open('unlabelled.json') as f:
    text = json.load(f)
print(len(text))

50000


A simple way to pretrain is to train a _language model_. This is a model that tries to predict the next word in a sentence. For instance, given:

In [26]:
" ".join(text[2].split()[:28])

'This is still a pretty bad and silly simplistic typical slasher but when being compared to the previous sequel "Slumber Party Massacre II" this movie is a step'

can you guess the next word?

In a language model, we consider the above the input, and the output is:

In [27]:
text[2].split()[28]

'up'

The input we encode using the embedding, while the output is a probability map over words.
In other words, the last layer is something like `nn.Linear(..., len(vocab))`.

### Exercise 2 (optional)

In [28]:
import lightning as L

In [None]:
class LanguageModel(nn.Module):
    def __init__(self, hidden_size=200, num_layers=2):
        super().__init__()
        self.lstm = nn.Transformer(nhead=8, num_decoder_layers=0)

Train a language model.

After you have trained the model, try to make to complete sentences.

## Transfer learning
### Exercise 3 (optional)

Discard the last layer of the now trained language model and use it to train on the original IMDB-problem.