# Natural Language Processing with RNNs & Attention

When Alan Turing  imagined his famous Turing test in 1950, his objective was to evaluate a machine's ability to match human intelligence. He could have tested for many things, such as the ability to recognise cats in pictures, play chess, compose music, or escapea maze, but interestingly, he chose a linguistic task. More specifically, he devised a *chatbot* capable of fooling its interlocutor into thinking it was human. This test does have its weaknesses: a set of hardcoded rules can fool unsuspecting or naive humans (e.g., the machine could give vague predefined answers in response to some keywords; it could pretend that it is joking or drunk, to get a pass on its weirdest answers; or it could escape difficult questions by answering them with its otwn questions), & many aspects of human intelligence are utterly ignored (e.g., the ability to interpret nonverbal communication such as facial expressions, or to learn a manual task). But the test does highlight the fact that mastering language is arguably *Homo sapien*'s greated cognitive ability. Can we build a machine that can read & write natural language?

A common approach for natural language tasks is to use recurrent neural networks. We will therefore continue to explore RNNs, starting with a *character* RNN, trained to predict the next character in a sentence. This will allow us to generate some original text, & in the process, we will see how to build a tensorflow dataset on a very long sequence. We will first use a *stateless* RNN (which learns on random portions of text at each iteration, without any information on the rest of the text), then we will build a *stateful* RNN (which preserves the hidden state between training iterations & continues reading where it left off, allowing it to learn longer patterns). Next, we will build a RNN to perform sentiment analysis (e.g., reading movie reviews & extracting the rater's feelings about the movie), this time treating sentences as sequences of words, rather than characters. Then we will show how RNNs can be used to build an encoder-decoder architecture capable of performing neural machine translation (NMT). For this, we will use the seq2seq API provided by the TensorFlow addons project.

In the second part of this lesson, we will look at *attention* mechanisms. As their name suggests, these are neural network components that learn to select the part of the inputs that the rest of the model should focus on at each time step. First, we will see how to boost the performance of an RNN-based encoder-decoder architecture using attention, then we will drop RNNs altogether & look at very successful attention-only architecture called the *transformer*. Finally, we will take a look at some of the most important advances in NLP, including incredibly powerful language models such as GPT-2 & BERT, both based on transformers.

We'll start with a simple & fun model that can write like Shakespeare (sort of).

---

# Generating Shakespearean Text Using a Character RNN

In a famous 2015 blog post titled "The Unreasonable Effectiveness of Recurrent Neural Networks", Andrej Karpathy showed how to train an RNN to predict the next character in a sentence. This *Char-RNN* can then be used to generate novel text, one character at a time. Here is a small sample of the text generated by a Char-RNN model after it was trained on all of Shakespeare's work:

    PANDARUS:
    Alas, I think he shall be come approached & the day
    When litle srain would be attain'd into being never fed,
    & who is but a chain & subjects of his death,
    I should not sleep.

Not exactly a masterpice, but it is still impressive that the model was able to learn words, grammar, proper punctuation, & more, just by learning to predict the next character in a sentence. Let's look at how to build a Char-RNN, step by step, starting with the creation of the dataset.

## Creating the Training Dataset

First, let's download all of Shakespeare's work, using kera's handy `get_file()` function & downloading the data from Andrej Karpath's char-RNN project:

In [2]:
import tensorflow as tf
from tensorflow import keras

shakespeare_url = "https://homl.info/shakespeare"
file_path = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(file_path) as f:
    shakespeare_text = f.read()

Downloading data from https://homl.info/shakespeare
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


Next, we must encode every character as an integer. One option is to create a custom preprocessing layer. But in this case, it will be simpler to use kera's `Tokenizer` class. First, we need to fit a tokenizer to the text: it will find all the characters used in the text & map each of them to a different character ID, from 1 to the number of distinct characters (it does not start at 0, so we can use that value for masking):

In [3]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level = True)
tokenizer.fit_on_texts([shakespeare_text])

We set `char_level = True` to get character-level encoding rather than the default word-level encoding. Note that this tokenizer converts the text to lower case by default (but you can set `lower = False` if you do not want that). Now the tokenizer can encode a sentence (or a list of sentences) to a list of character IDs & back, & it tells us how many distinct characters there are & the total number of characters in the text:

In [4]:
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [5]:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [6]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

Let's encode the full text so each character is represented by its ID (we subtract 1 to get IDs from 0 to 38, rather than from 1 to 39):

In [8]:
import numpy as np

[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Before we continue, we need to split the dataset into a training set, a validation set, & a test set. We can't just shuffle all the characters in the text, so how do you split a sequential dataset?

## How to Split a Sequential Dataset

It is very important to avoid any overlap between the training set, the validation set, & the test set. For example, we can take the first 90% of the text for the training set, then the next 5% for the validation set, & the final 5% for the text set. It would also be a good idea to leave a gap between these sets to avoid the risk of a paragraph overlapping over two sets.

When dealing with time series, you would in general split across time: for example, you might take years 2000 to 2012 for the training set, the years 2013 to 2015 for the validation set, & the years 2016 to 2018 for the test set. However, in some cases, you may be able to split along other dimensions, which will give you a longer time period to train on. For example, if you have data about the financial health of 10,000 companies from 2000 to 2018, you might be able to split this data across the different companies. It's very likely that many of these companies will be strongly correlated, though (e.g., whole economic sectors may go up or down jointly), & if you have correlated companies across the training set & the test set, your test set will not be as useful, as its measure of the generalisation error will be optimistically biased.

So, it is often safer to split across time -- but this implicitly assumes that the patterns the RNN can learn in the past (in the training set) will still exist in the future. In other words, we assume that the time series is *stationary* (at least in a wide sense). For many time series, this assumption is reasonable (e.g., chemical reactions should be fine, since the laws of chemistry don't change every day), but for many other it is not (e.g., financial markets are notoriously not stationary since patterns disappear as soon as traders spot them & start exploiting them). To make sure the time series is indeed sufficiently stationary, you can plot the model's errors on the validation set across time: if the model performs much better on the first part of the validation set than on the last part, then the time series may not be stationary enough, & you might be better off training the model on a shorter time span.

In short, splitting a time series into a training set, a validation set, & a test set is not a trivial task, & how it's done will depend strongly on the task at hand.

Now back to Shakespeare. Let's take the first 90% of the text for the training set (keeping the rest for the validation set & the test set), & create a `tf.data.Dataset` that will return each character one by one from this set:

In [9]:
train_size = dataset_size * 90  // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

## Chopping the Sequential Dataset into Multiple Windows

The training set now consists of a single sequence of over a million characters, so we can't just train the neural network directly on it: the RNN would be equivalent to a deep net with over a million layers & we would have a single (very long) instance to train it. Instead, we will use the dataset's `window()` method to convert this long sequence of characters into many smaller windows of text. Ever instance in the dataset will be a fairly short substrings of the whole text, & the RNN will be unrolled only over the length of these substrings. This is called *truncated backpropagation through time*. Let's call the `window()` method to create a dataset of short text windows.

In [11]:
n_steps = 100
window_length = n_steps + 1
dataset = dataset.window(window_length, shift = 1, drop_remainder = True)

By default, the `window()` method creates nonoverlapping windows, but to get the largest possible training set we use `shift = 1` so that the first window contains characters 0 to 100, the second contains characters 1 to 101, & so on. To ensure that all windows are exactly 101 characters long (which will allow us to create batches without having to do any padding), we set `drop_remainder = True` (otherwise the last 100 windows will contain 100 chracters, 99 chracters, & so on down to 1 character).

The `window()` method creates a dataset that contains windows, each of which is also represented as a dataset. It's a *nested dataset* analogous to a list of lists. This is useful when you want to transform each window by calling its dataset methods (e.g., to shuffle them or batch them). However, we cannot use a nested dataset directly for training, as our model will expect tensors as input, not datasets. So we must call the `flat_map()` method: it converts a nested dataset into a *flat dataset* (one that does not contain datasets). For example, suppose {1, 2, 3} represented a dataset containing the sequence of tensors 1, 2, & 3. If you flatten the nested dataset {{1, 2}, {3, 4, 5, 6}}, you will get back the flat dataset {1, 2, 3, 4, 5, 6}. Moreover, the `flat_map()` method takes a function as an argument, which allows you to transform each dataset in the nested dataset before flattening. For example, if you pass the function `lambda ds: ds.batch(2)` to `flat_map()`, then it will transform the nested dataset {{1, 2}, {3, 4, 5, 6}} into the flat dataset {[1, 2], [3, 4], [5, 6]}: it's a dataset of tensors of size 2. With that in mind, we are ready to flatten our dataset:

In [12]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Notice that we call `batch(window_length)` on each window: since all windows have exactly that length, we will get a single tensor for each of them. Now the dataset contains consevutive windows fo 101 characters each. Since gradient descent works best when the instances in the training set are independent & identically distributed, we need to shuffle these windows. Then we can batch the windows & separate the inputs (the first 100 characters) from the target (the last character):

In [13]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

The below figure summarises the dataset preparation steps so far (showing windows of length 11 rather than 101, & a batch size of 3 instead of 32).

<img src = "Images/Shuffled Windows.png" width = "600" style = "margin:auto"/>

As discussed in previous lessons, categorical input feature should generally be encoded, usually as one-hot vectors or as embeddings. Here, we will encode each characterusing a one-hot vector because there are fairly few distinct characters (only 39):

In [14]:
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth = max_id), Y_batch))

Finally, we just need to add prefetching:

In [15]:
dataset = dataset.prefetch(1)

That's it! Preparing the dataset was the hardest part. Now let's create the model.

## Building & Training the Char-RNN Model