# RNN and Natural Language Processing

In this exercise we will try to classify IMDB dataset: Given the text of a review, can you predict if the review was positive or negative?

Before doing this exercise, you might want to become more familier with LSTMs by considering the example FlightPassengerPredictions.

The data for this exercise can be found here:
https://sid.erda.dk/share_redirect/encok5nw3y

***

Author: Julius Kirkegaard and Troels C. Petersen<br>
Date: 5th of May 2024

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import pickle
from collections import defaultdict
from torch import nn
import json
from torch.utils.data import DataLoader
from tqdm import tqdm
from itertools import chain
from collections import Counter

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

## Data

In [None]:
limit_data = 10000  # limit the amount of data for speed, change as you please

def remove_special_symbols(string):
    return ''.join(s for s in string if ord(s)>96 and ord(s)<123 or s == ' ')

with open('train.json') as f:
    train_text, train_labels = json.load(f)
    idxs = np.random.permutation(len(train_text))
    train_text = [remove_special_symbols(train_text[i]) for i in idxs[:limit_data]]
    train_labels = [train_labels[i] for i in idxs[:limit_data]]
    
with open('test.json') as f:
    test_text, test_labels = json.load(f)
    idxs = np.random.permutation(len(test_text))   
    test_text = [remove_special_symbols(test_text[i]) for i in idxs[:limit_data]]
    test_labels = [test_labels[i] for i in idxs[:limit_data]]

Let's have a look at the data... Here is a negative review (label = 0)

In [None]:
print(train_text[0])
print(train_labels[0])

And a positive one:

In [None]:
print(train_text[3])
print(train_labels[3])

## Embedding

Let's have a look at all the words we have:

In [None]:
all_words = list(chain(*[x.lower().split() for x in train_text]))
print('total number of unique words =', len(set(all_words)))

That's a lot of words... of course, we could clean the data even more if we wanted to. But we won't...
(for instance, there are probably many misspelled words)

In [None]:
Counter(all_words)

In [None]:
n_words = 25000   # let's make a model that only understand 25000 words

In [None]:
words, count = np.unique(all_words, return_counts=True)
idxs = np.argsort(count)[-n_words:]
vocab = ['<UNK>'] + list(words[idxs][::-1])
print(vocab[:5], '...', vocab[-5:])

Not very surprisingly, the most commonly used word is _the_. The 25000th most used word is _chimneys_. We have added a special word `<UNK>` which we will used to mark words outside our vocabulary.

We can now turn a sentence into a sequence of integers that correspond to the position in the vocab.

In [None]:
vocab_d = {vocab[i]: i for i in range(len(vocab))}  # for quick look-up
def sentence_to_integer_sequence(s):
    return torch.tensor([vocab_d[x] if x in vocab_d else 0 for x in s.split()], dtype=torch.long)

In [None]:
sentence_to_integer_sequence("i really liked the movie xenopus51")

#### Representing high dimensional spaces with a simpler (learnable) embedding:

We are now representing words in a "25000"-dimensional space: we have a unique integer for each word we can represent. To reduce this complexity, we instead intend to represent each word as a 50-dimensional real vector. Pytorch to the rescue:

In [None]:
embedding = nn.Embedding(len(vocab), 50)

`nn.Embedding` assigns are random, **but trainable** vector to each word. For instance:

In [None]:
print(embedding(sentence_to_integer_sequence("movie")))

The special `<UNK>` word, which signifies unknown we can choose to zero out:

In [None]:
embedding.weight.data[0, :] = 0

Here is an example of how we represent a sentence then:

In [None]:
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")).shape)
print(embedding(sentence_to_integer_sequence("i really liked the movie xenopus51")))

In this way, the sentence bascially becomes a 6x50 pixel image.

You can now use `nn.LSTM` after an embedding to define a neural network for sentenes.

This network will have a _lot_ of parameters. For each word, 50 parameters needs to be trained, and then comes the LSTM on top of that.
This is the reason that _transfer learning_ is so important in natural language processing (NLP).

Perhaps the simplest form of transfer learning is to use a pretrained embedding layer.

In [None]:
with open('glove.6B.50d.pkl', 'rb') as f:
    glove = pickle.load(f)

In [None]:
glove['movie']

These pretrained word embeddings will have a good structure to them. For instance:

In [None]:
print('Distance b/w queen and prince =', np.linalg.norm(glove['queen'] - glove['prince']))
print('Distance b/w movie and prince =', np.linalg.norm(glove['movie'] - glove['prince']))

You can even sometimes get away with doing algebra with these vectors:

In [None]:
queenlike = glove['king'] - glove['man'] + glove['woman']

In [None]:
print('Distance b/w queen and algebraic queen =', np.linalg.norm(glove['queen'] - queenlike))
print('Distance b/w queen and king =', np.linalg.norm(glove['queen'] - glove['king']))

We can fill out our embedding layer using these pretrained vectors:

In [None]:
filled = 0
not_found = []
for i, w in enumerate(vocab):
    if w in glove:
        embedding.weight.data[i, :] = torch.tensor(glove[w], dtype=torch.float)
        filled += 1
    else:
        not_found.append(w)
print(f'Of the words in the vocab, {100 * filled / (len(vocab) - 1)} % were updated using Glove vectors')
print('Examples of words not found :', not_found[1:7])

Clearly the words we do not find are due to misspellings. You now have three choices before you continue with the exercise:

(1) Use _hunspell_ or similar to fix misspelled words

(2) Change the vocabulary (`vocab`) to be based on words that are both frequent in the text and have glove vectors

(3) Ignore the issue.

Finally, with Glove vectors we do not need to train the embedding. We can consider it fixed:

In [None]:
embedding.requires_grad_(False)

### Exercise 1

Train a neural network using `nn.LSTM` layers to classify the IMDB reviews.

Ideas:

 - The sentences can be quite long, so you might want to limit them to, say, maximum 100 words
 - `nn.LSTM` can take batched input, but be careful with `batch_first=True/False`.
 - If batched input are used, they normally have to be equal in size. You can put sentences to always be 100 words long, by adding `<UNK>` words to short sentences.
 - Alternatively, `nn.LSTM` can also accept batched, variable-length input using `torch.nn.utils.rnn.pack_sequence`.
 - (a finaly, albeit slow, alternative is to simple run the LSTM on un-batched input)


## Language model

Often in NLP, a lot of unlabelled text is available. We can use this to pretrain the model, before fine-tuning to the task at hand.

In [None]:
with open('unlabelled.json') as f:
    text = json.load(f)
print(len(text))

A simple way to pretrain is to train a _language model_. This is a model that tries to predict the next word in a sentence. For instance, given:

In [None]:
" ".join(text[2].split()[:28])

can you guess the next word?

In a language model, we consider the above the input, and the output is:

In [None]:
text[2].split()[28]

The input we encode using the embedding, while the output is a probability map over words.
In other words, the last layer is something like `nn.Linear(..., len(vocab))`.

### Exercise 2 (optional)

Train a language model.

After you have trained the model, try to make to complete sentences.

## Transfer learning
### Exercise 3 (optional)

Discard the last layer of the now trained language model and use it to train on the original IMDB-problem.