# Natural Language Processing with RNNs and Attention

A common approach for natural language tasks is to use recurrent neural networks. We will therefore continue to explore RNNs, starting with a *character RNN*, trained to predict the next character in a sentence. We will first use a *stateless RNN* (which learns on random portions of text at each iteration, without any information on the rest of the text), then we will build a *stateful RNN* (which preserves the hidden state between training iterations and continues reading where it left off, allowing it to learn longer patterns). Next, we will build an RNN to perform sentiment analysis (e.g., reading movie reviews and extracting the rater's feeling about the movie), this time treating sentences as sequences of words, rather than characters. Then we will show how RNNs can be used to build an Encoder-Decoder architecture capable of performing neural machine translation (NMT). 

In the second part of this chapter, we look at *attention mechanisms*. As their name suggests, these are neural network components that learn to select the part of the inputs that the rest of the model should focus on at each time step. First, we will see how to boost the performance of an RNN-based Encoder-Decoder architecture using attention, then we will drop RNNs altogether and look at a very successful attention-only architecture called the *Transformer*

## Generating Shakespearean Text Using a Character RNN

Let's look at how to build a Char-RNN, step by step, starting with the creation of the dataset

### Creating the Training Dataset

In [1]:
import tensorflow as tf

In [2]:
'''
First, let's download all of Shakespeare's work, using Keras' handy get_file() function
'''

shakespeare_url = 'https://homl.info/shakespeare'
filepath = tf.keras.utils.get_file('shakespeare.txt', shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Next, we must encode every character as an integer. In this case, it will be simpler to use Keras' Tokenizer class. First, we need to fit a tokenize to the text: it will fin all the characters used in the text and map each of them to a different character ID, from 1 to the number of distinct characters (t does not start at 0, so we can use that value for masking, as we will see later in this chapter).

In [3]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

In [4]:
'''
We set char_level=True to get character-level encoding rather than the default word-level encoding
'''

max_id = len(tokenizer.word_index)
total_tokens = len(tokenizer.texts_to_sequences([shakespeare_text])[0])
tokenizer.texts_to_sequences(['First']), tokenizer.sequences_to_texts([[20, 6, 9 , 8, 3]]), f'Max ID: {max_id} | Total Characters: {total_tokens}'

([[20, 6, 9, 8, 3]], ['f i r s t'], 'Max ID: 39 | Total Characters: 1115394')

In [5]:
import numpy as np
'''
Let's encode the full text so each character is represented by its ID.
We subtract 1 to get IDs from 0 to 38, rather than from 1 to 39
'''

[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Before we continue, we need to split the dataset into a training set, a validation set, and a test set. We can't just shuffle all the characters in the text, so how do you split a sequential dataset?

### How to Split a Sequential Dataset

When dealing with time series, you would in general split across time: for example, you might take the years 2000 to 2012 for the training set, the years 2013 to 2015 for the validation set, and the years 2016 to 2018 for the test set. However, in some cases you may be able to split along other dimensions, which will give you a longer time period to train on. For example, if you have data about the financial health of 10,000 companies from 2000 to 2018, you might be able to split this data across the different companies. It's very likely that many of these companies will be strongly correlated, though (e.g. whole economic sectors may go up or down jointly), and if you have correlated companies across the training set and the test set your test set will not be as useful, as its measure of the generalization error will be optimistically biased.

So, it is often safer to split across time - but this implicitly assumes that the patterns the RNN can learn in the past (in the training set) will still exist in the future. In other words, we assume that the time series is *stationary* (at least in a wide sense). For many time series this assumption is reasonable (e.g. chemical reactions should be fine, since the laws of chemistry don't change every day), but for many others it is not (e.g. financial markets are notoriously not stationary since patterns disappear as soon as traders spot them and start exploiting them). **To make sure the time series is indeed sufficiently stationary, you can plot the model's errors on the validation set across time: if the model performs much better on the first part of the validation set than on the last part, then the time series may not be stationary enough, and you might be better off training the model on a shorter time span.**

In short, spiltting a time series into a training set, a validation set, and a test set is not a trivial task, and how it's done will depend strongly on the task at hand.

In [6]:
'''
For this Shakespeare example we take the first 90% of the text for the training set and use the rest for validation and test
'''

# Convert text to integer sequence ---
sequence = tokenizer.texts_to_sequences([shakespeare_text])[0]  # Flattened list of token IDs

# Define split ratios
train_ratio = 0.9
val_ratio = 0.05  # test will be the rest (0.05)

# Compute split indices
total_tokens = len(sequence)
train_end = int(total_tokens * train_ratio)
val_end = train_end + int(total_tokens * val_ratio)

#  Split the data
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_end])

### Chopping the Sequential Dataset into Multiple Windows

The training set now consists of a single sequence of over a million characters, so we can't just train the neural network directly on it: the RNN would be equivalent to a deep net with over a million layers, and we would have a single (very long) instance to train it. Instead, we will use the dataset's window() method to convert this long sequence of characters into many smaller windows of text. Every instance in the dataset will be a fairly short substring of the whole text, and the RNN will be unrolled only over the length of these substrings. **This is called *truncated propagation through time*.** Let's call the window() method to create a dataset of short text windows:

In [7]:
'''
You can try tuning n_steps: it is easier to train RNNs on shorter input sequences, but of course the RNN will not be able to learn any pattern longer than n_steps, so don't make it too small
'''
n_steps = 100
window_length = n_steps + 1
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

**By default, the window() method creates nonoverlapping windows, but to get the largest possible training set we use shift=1 so that the first window contains characters 0 to 100, the second contains characters 1 to 101, and so on.** To ensure that all windows are exactly 101 characters long (which will allow us to create batches without having to do any padding), we set drop_remainder=True (otherwise the last 100 windows will contains 100 characters, 99 characters, and so on down to 1 character).

The window() method creates a dataset that contains windows, each of which is also represented as a dataset. It's a *nested dataset*, analogous to a list of lists. This is useful when you want to transform each window by calling its dataset methods (e.g. to shuffle them or batch them).**However, we cannot use a nested dataset directly for training, as our moedel will expect tensors as input, not datasets. So, we must call the flat_map() method: it converts a nested dataset into a *flat dataset*.**

Moreover, the flat_map() method takes a function as an argument, which allows you to transform each dataset in the nested dataset before flattening. For exapmle, if you pass the function lambda ds: ds.batch(2) to flat_map(), then it will transform the nested dataset {{1, 2}, {3, 4, 5, 6}} into the flat dataset {[1, 2], [3, 4], [5, 6]}: it's a dataset of tensors of size 2. With that in mind, we are ready to flatten our dataset.

In [8]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Notice that we call batch(window_length) on each window: since all windows have exactly that length, we will get a single tensor for each of them. Since Gradient Descent works best when the instances in the training set are independent and identically distributed, we need to shuffle these windows. Then we can batch the windows and separate the inputs (the first 100 characters) from the target (the last character)

In [9]:
batch_size = 256
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [10]:
'''
Catagorical input features should generally be encoded, usually as one-hot vectors or embeddings. Here, we will encode each character using a one-hot vector (because there are few [39])).
'''

dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

# Finally, add prefetching
dataset = dataset.prefetch(1)

### Building and Training the Char-RNN Model

Preparing the dataset was the hardest part. Now let's create the model. We can use an RNN with 2 GRU layers of 128 units each and a 20% dropout on both the inputs (dropout) and hidden states (recurrent_dropout). The output layer is a time-distributed Dense layer. This time this layer must have 39 units because there are 39 distinct characters in the text, and we want to output a probability for each possible character. We apply the softmax activation function to the outputs of the Dense layer. We can then compile this model, using the 'sparse_categorical_crossentropy' loss and an Adam optimizer.

In [11]:
# This cell takes several hours to run 20 epochs

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, GRU, TimeDistributed, Dense

model = Sequential([
    Input([None, max_id]),
    GRU(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    GRU(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    TimeDistributed(Dense(max_id, activation='softmax'))
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
history = model.fit(dataset, epochs=2)

Epoch 1/2
[1m3921/3921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3125s[0m 796ms/step - loss: 1.7444
Epoch 2/2




[1m3921/3921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3152s[0m 804ms/step - loss: 1.5191


### Using the Char-RNN Model

Now we have a model that can predict the next character in text written by Shakespeare. To feed it some text, we first need to preprocess it like we did earlier, so let's create a little function for this:

In [12]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

# Now predict
X_new = preprocess(['How are yo'])
Y_pred = model.predict(X_new) + 1
tokenizer.sequences_to_texts(Y_pred.reshape(1, -1))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 834ms/step


['                                                     ']

### Generating Fake Shakespearean Text

To generate new text using the Char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it at the end of the text, then give the extended text to the model to guess the next letter, and so on. But in practice, this often leads to the same words being repeated over and over again. Instead, we can pick the next character randomly, with a probability equal to the estimated probability, using TensorFlow's tf.random.categorical() function. 

The categorical() function samples random class indices, given the class log probabilities (logits). To have more control over the diversity of the generated text, we can divide the logits by a number called the *temperature*, which we can tweak as we wish: a temperature close to 0 will favor the high-probability characters, while a very high temperature will give all characters an equal probability. 

In [13]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

complete_text('t', temperature=0.2)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 983ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4

'this fair for the rest,\nand we will not be so will '

In [14]:
complete_text('w', temperature=2)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36

"w'dined.\nshe gold. i'll your yield?\nbeyech! scyiath"

**To generate more convincing text**, you could try using more GRU layers and more neurons per layer, train for longer, and add some regularization (for example, you could set recurrent_dropout=0.3 in GRU layers). Moreover, the model is currently incapable of learning patterns longer than n_steps, which is just 100 characters. You could try making this window larger, but it will also make training harder, and even LSTM and GRU cells cannot handle very long sequences. Alternatively, **you could use a stateful RNN**

### Stateful RNN

Until now, we have used only *stateless RNNs*: at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away, as it is not needed anymore. What if we told the RNN to preserve this final state after processing one training batch and use it as the initial state for the next training batch? This way the model can learn long-term patterns despite only backpropagating through short sequences. This is called a *stateful RNN*. 

**First, note that a stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off.** So the first thing we need to do to build a stateful RNN is to use sequential and nonoverlapping input sequences (rather than the shuffled and overlapping sequences we used to train stateless RNNs).

**Unfortunately, batching is much harder when preparing a dataset for a stateful RNN than it is for a stateless RNN.** Indeed, if we were to call batch(32), then 32 consecutive windows would be put in the same batch, and the following batch would not continue each of these windows where it left off. **The simplest solution to this problem is to juse use "batches" containing a single window.**

Batching is harder, but it is not impossible. For example, we could chop Shakespeare's text into 32 texts of equal length, create one dataset of consecutive input sequences for each of them, and finally use tf.train.Dataset.zip(datasets).map(lambda *windows: tf.stack(windows)) to create proper consecutive batches, where the nth input sequence in a batch starts off exactly where the nth input sequence ended in the previous batch.

In [15]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_end])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Now let's create a stateful RNN. First, we need to set stateful=True when creating every recurrent layer. Second, the stateful RNN needs to know the batch size (since it will preserve a state for each input sequence in the batch), so we must set the batch_input_shape argument in the first layer. Note that we can leave the second dimension unspecified, since the inputs could have any length.

At the end of each epoch, we need to reset the states before we go back to the beginning of the text. For this, we can use a small callback.

**After this model is trained, it will only be possible to use it to make predictions for batches of the same size as were used during training. To avoid this restriction, create an identical *stateless* model, and copy the stateful model's weights to this model.**

In [17]:
'''This cell takes an absolutely absurd amount of time. Run at your own risk.'''

# model = Sequential([
#     Input(batch_shape=(1, window_length - 1, max_id)),
#     GRU(128, return_sequences=True, stateful=True, dropout=0.2, recurrent_dropout=0.0),
#     GRU(128, return_sequences=True, stateful=True, dropout=0.2, recurrent_dropout=0.0),
#     TimeDistributed(Dense(max_id, activation='softmax'))
# ])

# from tensorflow.keras.callbacks import Callback

# class ResetStatesCallback(Callback):
#     def on_epoch_begin(self, epoch, logs=None):
#         for layer in self.model.layers:
#             if hasattr(layer, "reset_states"):
#                 layer.reset_states()


# model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', run_eagerly=True)
# model.fit(dataset, epochs=1, callbacks=[ResetStatesCallback()])

'This cell takes an absolutely absurd amount of time. Run at your own risk.'

## Sentiment Analysis

The IMDb reviews dataset is the "hello world" of natural language processing. The IMDb reviews dataset is popular for good reasons: it is simple enough to be tackled in a short amount of time, but challenging enough to be fun and rewarding. Keras provides a simple function to load it.

The dataset is already preprocessed for you: **X_train consists of a list of reviews, each of which is represented as a NumPy array of integers, where each integer represents a word. All punctuation was removed, and then words were converted to lowercase, split by spaces, and finally indexed by frequency (so low integers correspond to frequent words).** The integers 0, 1, and 2 are special: they represent the padding token, the *start-of-sequence* (SSS) token, and unknown words, respectively. If you want to visualize a review, you can decode it like in the example below.

In a real project, you will have to preprocess the text yourself. When encoding words, it filters out a lot of characters, including most punctuation, line breaks, and tabs (but you can change this by setting the ***filters*** argument). Most importantly, it uses spaces to identify word boundaries. This is OK for English and many other scripts that use spaces between words, but not all scripts use spaces this way. Chinese does not use spaces between words, Vietnamese uses spaces even within words, and languages such as German often attach multiple words together, without spaces.

Fortunatley, there are better options! The 2018 paper by Taku Kudo "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates" introduced an unsupervised learning techinque to tokenize and detokenize text at the subword level in a language-independent way, treating spaces like other characters. With this approach, even if your model encounters a word it has never seen before, it can still reasonably guess what it means. Google's *SentencePiece* project provides an open source implementation of this paper. 

Last but not least, the Tensorflow team released the TF.Text library in June 2019, which implements various tokenization strategies, including WordPiece (a variant of byte pair encoding).

If you want to deploy your model to a mobile device or a web browser, and you don't want to have to write a different preprocessing function every time, then you will want to handle preprocessing using only TensorFlow operations, so it can be included in the model itself. An exmple of this is shown below.

TF Transform (introduced in Chapter 13) provides some useful functions to handle such vocabularies. For example, check out the tft.compute_and_apply_vocabulary() function: it will go through the dataset to find all distinct words and build the vocabulary, and it will generate the TF operations required to encode each word using this vocabulary.

Now we are ready to create the final training set.

In [18]:
'''Load the IMDb dataset'''
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data()
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

In [19]:
'''Decode a Review'''
word_index = tf.keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}

for id_, token in enumerate(('<pad>', '<sos>', '<unk>')):
    id_to_word[id_] = token

" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

In [20]:
'''Preprocessing Using Only Tensorflow Operations'''
import tensorflow_datasets as tfds
import tempfile, shutil
from tensorflow.keras.layers import Embedding

# Temp Directory
tmp_dir = tempfile.mkdtemp()

# Load the dataset and define the training partition
try:
    datasets, info = tfds.load(
        "imdb_reviews/plain_text",
        as_supervised=True,
        with_info=True,
        data_dir=tmp_dir,
    )
    train_size = info.splits['train'].num_examples

    # Write the preprocessing function
    def preprocess(X_batch, y_batch):
        X_batch = tf.strings.substr(X_batch, 0, 300)
        X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ") # Replace the start of an HTML line break tag in common forms with blank
        X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ") # Strip non-word characters while keeping contractions
        X_batch = tf.strings.split(X_batch) # Split by spaces
        return X_batch.to_tensor(default_value=b"<pad>"), y_batch # Pad all reviews to ensure they're the same length
    
    # Construct the vocabulary
    from collections import Counter
    vocabulary = Counter()
    for X_batch, y_batch in datasets['train'].batch(32).map(preprocess):
        for review in X_batch:
            vocabulary.update(list(review.numpy()))
    
    print(f'Most common words: {vocabulary.most_common()[:3]}')
    
    # Truncate the vocabulary to the 10,000 most common words
    vocab_size = 10_000
    truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]
    
    # Replace each word with its ID / vocabulary index
    words = tf.constant(truncated_vocabulary)
    word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
    vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
    num_oov_buckets = 1_000
    table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

    # Define and encode words
    def encode_words(X_batch, y_batch):
        return table.lookup(X_batch), y_batch

    train_set = datasets['train'].batch(32).map(preprocess)
    train_set = train_set.map(encode_words).prefetch(1)

    # Build and train model
    embed_size = 128
    model = Sequential([
        Input([None]),
        Embedding(vocab_size + num_oov_buckets, embed_size),
        GRU(128, return_sequences=True),
        GRU(128),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    history = model.fit(train_set, epochs=1)
    
finally:
    shutil.rmtree(tmp_dir, ignore_errors=True)



[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\Steph\AppData\Local\Temp\tmp2ttykdih\imdb_reviews\plain_text\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\Steph\AppData\Local\Temp\tmp2ttykdih\imdb_reviews\plain_text\incomplete.CLH1XK_1.0.0\imdb_r…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\Steph\AppData\Local\Temp\tmp2ttykdih\imdb_reviews\plain_text\incomplete.CLH1XK_1.0.0\imdb_r…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\Steph\AppData\Local\Temp\tmp2ttykdih\imdb_reviews\plain_text\incomplete.CLH1XK_1.0.0\imdb_r…

[1mDataset imdb_reviews downloaded and prepared to C:\Users\Steph\AppData\Local\Temp\tmp2ttykdih\imdb_reviews\plain_text\1.0.0. Subsequent calls will reuse this data.[0m
Most common words: [(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 52ms/step - accuracy: 0.5228 - loss: 0.6912


### Masking

As it stands, the model will need to learn that the padding tokens should be ignored. Why don't we tell the model to ignore the padding tokens, so that it can focus on the data that actually matters? It's actually quite trivial: simply add mask_zero=True when creating the Embedding Layer. This means that padding tokens (whose ID is 0) will be ignored by all downstream layers.

The way this works is that the Embedding layer creates a *mask tensor* equal to K.not_equal(inputs, 0) (where K = keras.backend): it is a Boolean tensor with the same shape as the inputs, and it is equal to False anywhere ther word IDs are 0, or True otherwise.

Each layer may handle the mask differently, but in general they simply ignore masked time steps (i.e. time steps for which the mask is False). The LSTM and GRU layers have an optimized implementation for GPUs, based on Nvidia's cdDNN library. However, this implementation does not support masking. If your model uses a mask, then these layers will fall back to the (much slower) default implementation. Note that the optimized implementation also requires you to use the default values for several hyperparameters: activation, recurrent_activation, recurrent_dropout, unroll, use_bias, and reset_after.

All layers that receive the mask must support masking. Any layer that supports masking must have a supports_masking attribute equal to True. If you want to implement your own custom layer with masking support, you should add a mask argument to the call() method (and obviously make the method use the mask somehow ). Additionally, you should set self.supports_masking=True in the constructor. If your layer does not start with an Embedding layer, you may use the keras.layers.Masking layer instead: it sets the mask to K.any(K.not_equal(inputs, 0), axis=-1), meaning that time steps where the last dimension is full of zeros will be masked out in subsequent layers.

Using masking layers and automatic mask propagation works best for simple Sequential models. It will not always work for more complex models, such as when you need to mix Conv1D layers with recurrent layers. In such cases, you will need to explicitly compute the mask and pass it to the appropriate layers, using either the Functional API or the Subclassing API. For example, the following model is identical to the previous model, except it is built using the Functional API and handles masking manually.

If all postive words and all negative words form clusters, then this will be helpful for sentiment analysis. So instead of using so many parameters to learn word embeddings, let's see if we can't just reuse pretrained embeddings.

In [21]:
from tensorflow.keras.layers import Lambda
from tensorflow.keras import Model

K = tf.keras.backend
inputs = Input(shape=[None])
mask = Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = GRU(128, return_sequences=True)(z, mask=mask)
z = GRU(128)(z, mask=mask)
outputs = Dense(1, activation='sigmoid')(z)
model = Model(inputs=[inputs], outputs=[outputs])







### Reusing Pretrained Embeddings

The TensorFlow Hub project makes it easy to reuse pretrained model components in your own models. These model components are called *modules*. Simply browse the TF Hub repository (https://tfhub.dev), find the one you need, and copy the code example into your project, and the module will be automatically downloaded, along with its pretrained weights, and included in your model. By default, a hub .KerasLayer is not trainable, but you can set trainable=True when creating it to chagne that so that you can fine-tune it for your task. 

Not all TF Hub modules support TensorFlow 2, so make sure you choose a module that does. By default, TF Hub will cache the downloaded files into the local system's temporary directory. You may prefer to download them into a more permanent directory to avoid having to download them again after every system cleanup. To do that, set the TFHUB_CACHE_DIR environment variable to the directory of your choice.

In [38]:
# import tempfile, shutil
# import tensorflow as tf
# import tensorflow_datasets as tfds
# import tensorflow_hub as hub
# import tf_keras as keras
# from tf_keras import layers

# handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1"

# model = keras.Sequential([
#     hub.KerasLayer(handle, dtype=tf.string, input_shape=(), trainable=False),
#     layers.Dense(128, activation="relu"),
#     layers.Dense(1, activation="sigmoid"),
# ])

# model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# tmp_dir = tempfile.mkdtemp()
# try:
#     datasets, info = tfds.load(
#         "imdb_reviews/plain_text",
#         as_supervised=True,
#         with_info=True,
#         data_dir=tmp_dir,
#     )
#     train_set = datasets["train"].batch(32).prefetch(tf.data.AUTOTUNE)
#     history = model.fit(train_set, epochs=1)
# finally:
#     shutil.rmtree(tmp_dir, ignore_errors=True)

Next, let's look at another important NLP task: *neural machine translation* (NMT), first using a pure Encoder-Decoder model, then improving it with attention mechanisms, and finally looking at the extraordinary Transformer architecture.

## An Encoder-Decoder Network for Neural Machine Translation (NMT)

Let's take a look at a simple neural machine translation model. In short, the English sentences are fed to the encoder, and the decoder outputs the French translations. For the very first word, it is given the start-of-sequence (SOS) token. The decoder is expected to end the sentence with an end-of-sequence (EOS) token. Note that the English sentences are reversed before they are fed to the encoder. For exapmle, "I drink milk" is reversed to "milk drink I". This ensures that the beginning of the English sentence will be fed last to the encoder, which is useful because that's generally the first thing that the decoder needs to translate.

Each word is initially represented by its ID. Next, an embedding layer returns the word embedding. These word embeddings are what is actually fed to the encoder and the decoder. At each step, the decoder outputs a score for each word in the output vocabulary (i.e French), and then the softmax layer turns these scores into probabilities. The word with the highest probability is output. This is very much like a regular classification task, so you can train the model using the "sparse_categorical_crossentropy" loss. Note that at inference time (after training), you will not have the target sentence to feed to the decoder. Instead, simply feed the decoder the word that it output at the previous step. 

**There are a few more details to handle if you implement this model:**

> So far we have assumed that all input sequences have a constant length. But obviously sentence lengths vary. Since regular tensors have fixed shapes, they can only contain sentences of the same length. You can use masking to handle this, as discussed earlier. However, if the sentences have very different lengths, you can't just crop them like we did for sentiment analysis (because we want full translations). Instead, group sentences into buckets of similar lengths using padding for the shorter sequences to ensure all sentences in a bucket have the same length (check out the tf.data.experimental.bucket_by_sequence_length() function)
<br><br>
> We want to ignore any ouptut past the EOS token, so these tokens should not contribute to the loss. For example, if the model outputs "Je bois du lait <eos> oui", the loss for that last word should be ignored.
<br><br>
> When the output vocabulary is large (which is the case here), outputting a probability for each and every possible word would be terribly slow. To avoid this, one solution is to look only at the logits output by the model for the correct word and for a random sample of incorrect words, then compute an approximation of the loss based only on these logits. This *sampled softmax* technique was introduced in 2015. In TensorFlow you can use the tf.nn.sampled_softmax_loss() function for this during training and use the normal softmax function at inference time (sampled softmax cannot be used at inference because it requires the target).

The TensorFlow Addons project includes many sequence-to-sequence tools to let you easily build production-ready Encoder-Decoders. The code is mostly self-explanatory, but there are a few points to note. First, we set return_state=True when creating the LSTM layer so that we can get its final hidden state and pass it to the decoder. Since we are using an LSTM cell, it actually returns two hidden states (short term and long term). The TrainingSampler is one of several samplers available in TensorFlow Addons: their role is to tell the decoder at each step what it should pretend the previous output was. During inference, this should be the embedding of the token that was actually output. During training, it should be the embedding of the previous target token: this is why we used the TrainingSampler. In practice, it is often a good idea to start training with the embedding of the target of the previous time step and gradually transition to using the embedding of the actual token that was output at the previous step. The ScheduledEmbeddingTrainingSampler will randomly choose between the target or the actual output, with a probability that you can gradually change during training.

In [24]:
'''TFA explicitly supports TensorFlow ≥ 2.12 and < 2.15'''

# import tensorflow_addons as tfa
# from tensorflow.keras.layers import LSTM, LSTMCell

# # encoder/decoder token ids: [batch, time]
# encoder_inputs = Input(shape=(None,), dtype=tf.int32, name="encoder_inputs")
# decoder_inputs = Input(shape=(None,), dtype=tf.int32, name="decoder_inputs")

# # decoder lengths: [batch]
# sequence_lengths = Input(shape=(), dtype=tf.int32, name="decoder_lengths")

# embeddings = Embedding(vocab_size, embed_size, name="token_embedding")
# encoder_embeddings = embeddings(encoder_inputs)   # [batch, time, embed]
# decoder_embeddings = embeddings(decoder_inputs)   # [batch, time, embed]

# # Encoder
# encoder = LSTM(512, return_state=True, name="encoder_lstm")
# _, state_h, state_c = encoder(encoder_embeddings)
# encoder_state = [state_h, state_c]  # LSTMCell state: [h, c]

# # Decoder
# sampler = tfa.seq2seq.sampler.TrainingSampler()
# decoder_cell = LSTMCell(512, name="decoder_cell")
# output_layer = Dense(vocab_size, name="vocab_projection")

# decoder = tfa.seq2seq.basic_decoder.BasicDecoder(
#     cell=decoder_cell,
#     sampler=sampler,
#     output_layer=output_layer,
# )

# final_outputs, final_state, final_sequence_lengths = decoder(
#     decoder_embeddings,
#     initial_state=encoder_state,
#     sequence_length=sequence_lengths,
# )

# logits = final_outputs.rnn_output                 # [batch, time, vocab]
# Y_proba = tf.nn.softmax(logits, axis=-1)

# model = Model(
#     inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
#     outputs=Y_proba,
# )

'TFA explicitly supports TensorFlow ≥ 2.12 and < 2.15'

### Bidirectional RNNs

At each time step, a regular recurrent layer only looks at past and present inputs before generating its output. This type of RNN makes sense when forecasting time series, but for many NLP tasks, such as Neural Machine Translation, it is often preferable to look ahead at the next words before encoding a given word. To implement this, run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left. Then simply combine their outputs at each time step, typically by concatenating them. This is called a *bidirectional recurrent layer*.

To implement a bidirectional recurrent layer in Keras, wrap a recurrent layer in a keras.layers.Bidirectional layer. The Bidirectional layer will create a clone of the GRU layer (but in the reverse direction), and it will run both and concatenate their outputs. So although the GRU layer has 10 units, the Bidirectional layer will output 20 values per time step.

In [25]:
from tensorflow.keras.layers import Bidirectional

Bidirectional(GRU(10, return_sequences=True))

<Bidirectional name=bidirectional, built=False>

### Beam Search

Suppose you train an Encoder-Decoder model, and use it to translate the French sentence "Comment vas-tu?" to English. You are hoping that it will output the proper translation ("How are you?") but unfortunately it outputs "How will you?". By greedily outputting the most likely word at every step, it ended up with a suboptimal translation. How can we give the model a chance to go back and fix mistakes it made earlier? One of the most common solutions is *beam search*: it keeps track of a short list of the k most promising sentences (say, the top three), and at each decoder step it tries to extend them by one word, keeping only the k most likely sentences. The parameter k is called the *beam width*. We can boost our Encoder-Decoder model's performance without any extra training simply by using it more wisely. You can implement beam search fairly easily using TensorFlow Addons

We first create a BeamSearchDecoder, which wraps all the decoder clones (in this case 10 clones). Then we create one copy of the encoder's final state for each decoder clone, and we pass these states to the decoder, along with the start and end tokens. With all this, you can get good translations for fairly short sentences. Unfortunately, this model will be really bad at translating long sentences. Once again, the problem comes from the limited short-term memory of RNNs. *Attention mechanisms* are the game-changing innovation that addressed this problem.

## Attention Mechanisms

In a groundbreaking 2014 paper by Dzmitry Bahdanau et al they introduced a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step. This means that the path from an input word to its translation is now much shorter, so the short-term memory limitations of RNNs have much less impact. Attention mechanisms revolutionized neural machine translation (and NLP in general), allowing a significant imrprovement in the state of art, especially for long sentences (over 30 words.).

At each time step, the decoder's memory cell computes a weighted sum of all these encoder outputs: this determines which words it will focus on at this step. The weight $\alpha_{(t,j)}$ is the weight of the $i^{th}$ encoder ouptut at the $t^{th}$ decoder time step. For example, if the weight $\alpha_{(3, 2)}$ is much larger than the weights $\alpha_{(3, 0)}$ and $\alpha_{(3, 1)}$, then the decoder will pay much more attention to word number 2 ("milk") than to the other two words, at least at this time step. The rest of the decoder works just like earlier. 

But where do these $\alpha_{(t,j)}$ weights come from? It's actually pretty simple: they are generated by a type of small neural network called an *alignment model* (or an *attention layer*), which is trained jointly with the rest of the Encoder-Decoder model. This time-distributed Dense layer with a single neuron, which receives as input all the encoder outputs, concatenated with the decoder's previous hidden state The layer outputs a score (or energy) for reach encoder output: this score measures how well each output is aligned wit hteh decoder's previous hidden state. Finally, all the scores go through a softmax layer to get a final weight for each encoder output. All the weights for a given decoder time step add up to 1. This particular attention mechanism is called *Bahdanau attention*. Since it concatenates the encoder output with the decoder's previous hidden state, it is sometimes called *concatenative attention* or *additive attention*.

***Recall that a time-distributed Dense layer is equiavalent to a regular Dense layer that you apply independently at each time step (only much faster)*** 

If the input sentence is n words long, and assuming the output sentence is about as long, then this model will need to compute about $n^2$ weights. Fortunately, this quadratic computational complexity is still tractable because even long sentences don't have thousands of words. 

Another common attention mechanism was proposed shortly after in a 2015 paper. Because the goal of the attention mechanism is to measure the similarity between one of the encoder's outputs and the decoder's previous hidden state, the authors proposed to simply compute the dot product of these two vectors, as this is often a fairly good similarity measure, and model hardware can compute it much faster. For this to be possible, both vectors must have the same dimensionality. This is called *Luong attention*, or sometimes *multiplicative attention*. The dot product gives a score, and all the scores (at a given decoder time step) go through a softmax layer to give the final weights, just like the Bahdanau attention. Another simplification they proposed was to use the decoder's hidden state at the current step rather than at the previous time step, then to use the output of the attention mechanism directly to compute the decoder's predictions (rather than using it to compute the decoder's current hidden state). They also proposed a variant of the dot product mechanism where the encoder outputs first go through a linear transformation (i.e. a time-distributed Dense layer without a bias term) before the dot products are computed. This is called the "general" dot product approach. They compared both dot product approaches to the concatenative attention mechanism (adding a rescaling parameter vector **v**), and they observed that the dot product variants performed better than concatenative attention. For this reason, concatenative attention is much less used now. 

### Visual Attention

**Attention mechanisms are now used for a variety of purposes. One of their applications beyond NMT was in generating image captions using visual attention: a convolutional neural network first processes the image and outputs some feature maps, then a decoder RNN equipped with an attention mechanism generates the caption, one word at a time. At each decoder time step (each word), the decoder uses the attention model to focus on just the right part of the image.** Attention mechanisms are so powerful that you can actually build state-of-the-art models using only attention mechanisms.

### Explainability

One extra benefit of attention mechanisms is that they make it easier to understand what led the model to produce its output. This is called *explainability*. It can be especially useful when the model makes a mistake. In some applications, explainability is not just a tool to debug a model: it can be a legal requirement (think of a system deciding whether or not it should grant you a loan).

### Attention Is All You Need: The Transformer Architecture

In a groundbreaking 2017 paper a team of Google researchers managed to create an architecture called the *Transformer*, which significantly improved the state of the art in NMT without using any recurrent or convolutional layers, just attention mechanisms (plus embedding layers, dense layers, normalization layers, and a few other bits and pieces). As an extra bonus, this architecture was also much faster to train and easier to parallelize, so they managed to train it at a fraction of the time and cost of the previous state-of-the-art models. Let's look a bit closer at both of the novel compenents of the Transformer architecture, starting with positional embeddings.

#### Positional Embeddings

**A positional embedding is a dense vector that encodes the position of a word within the sentence**: the $i^{th}$ positional embedding is simply added to the word embedding of the $i^{th}$ word in the sentence. These positional embeddings can be learned by the model, but in the paper the authors preferred to use fixed positional embeddings, defined using the sine and cosine functions of different frequencies. This solution gives the same performance as learned positional embeddings do, but it can extend to arbitrarily long sentences, which is why it's favored. After the positional embeddings are added to the word embeddings, the rest of the model has access to the absolute position of each word in the sentence because there is a unique positional embedding for each position.

Moreover, the choice of oscillating functions makes it possible for the model to learn relative positions as well. For example, words located 38 words apart (e.g. at positions p = 22 and = 60) always have the same positional embedding values in the embedding dimensions. **This explains why we need both the sine and cosine for each frequency: if we only use the sine, the model would not be able to distinguish positions p = 25 and p = 35**.

There is no PositionalEmbedding layer in TensorFlow, but it is easy to create one. For efficiency reasons, we precompute the positional embedding matrix in the constructor (so we need to know the maximum sentnece length, max_steps, and the number of dimensions for each word representation, max_dims). Then the call() method crops the embedding matrix to the size of the inputs, and it adds it to the inputs. Since we added an extra first dimension of size 1 when creating the positional embedding matrix, the rules of broadcasting will ensure that hte matrix gets added to every sentence in the inputs. Next we look deeper into the heart of the Transformer model: The Multi-Head Attention layer.

In [27]:
# Create the class that handles positional embedding
class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims +=1 # max_dims must be even
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))
        pos_emb[0, :, ::2] = np.sin(p / 10_000 ** (2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10_000 ** (2 * i / max_dims)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))

    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs * self.positional_embedding[:, :shape[-2], :shape[-1]]

# Create the first layers of the Transformer
embed_size = 512; max_steps = 500; vocab_size = 10_000
encoder_inputs = Input(shape=[None], dtype=np.int32)
decoder_inputs = Input(shape=[None], dtype=np.int32)
embeddings = Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
positional_encoding = PositionalEncoding(max_steps, max_dims=embed_size)
encoder_in = positional_encoding(encoder_embeddings)
decoder_in = positional_encoding(decoder_embeddings)

#### Multi-Head Attention

To understand how a Multi-Head Attention layer works, we must first understand the *Scaled Dot Product Attention* layer, which it is based on. Let's suppose the encoder analyzed the input sentence "They played chess" and it managed to understand that the word "They" is the subject and the word "played" is the verb. The model does not have discrete tokens to represent the keys (like "subject" or "verb"); it has vectorized representations of these concepts (which it learned during training), so they key it will use for the lookup (called the *query*) will not perfectly match any key in the dictionary. The solution is to compute a similarity measure between the query and each key in the dictionary, then use the softmax function to conver these similarity scores to weights that add up to 1.

**In short, you can think of this whole preocess as a differentiable dictionary lookup. The similarity measure used by the Transformer is just the dot product, like in Luong attention.**

The keras.layers.Attention layer implements Scaled Dot-Product Attention. If we ignore the skip connections, the layer normalization layers, the Feed Forward blocks, and the fact that this is Scaled Dot-Product attention, not exactly Multi-Head Attention, then the rest of the Transformer model can be implemented as follows.

In [31]:
from tensorflow.keras.layers import Attention

Z = encoder_in
for N in range(6):
    Z = Attention(use_scale=True)([Z, Z])

encoder_outputs = Z
Z = decoder_in
for N in range(6):
    Z = Attention(use_scale=True)([Z, Z])
    Z = Attention(use_scale=True)([Z, encoder_outputs])

outputs = TimeDistributed(Dense(vocab_size, activation='softmax'))(Z)

The use_scale=True argument creates an additional parameter that lets the layer learn how to properly downscale the similarity scores. This is a bit different from the Transformer model, which always downscales the similarity scores by the same factor $\sqrt{d_{keys}}$.

Now it's time to look at the final piece of the puzzle: **What is a Multi-Head Attention layer? It is just a bunch of Scaled Dot-Product Attention layers, each preceded by a linear transformation of the values, keys, and queries (i.e., a time-distributed Dense layer with no activation function). All the outputs are simply concatenated, and they go through a final linear transformation (again, time-distributed). But why?** What is the intuition behind this architecture? Well, consider the word "played" we discussed earlier. The encoder was smart enough to encode the fact that it is a verb. But the word representation also includes its position in the text, thanks to the positional encodings, and it probably includes many other features that are useful for its translation, such as the fact that it is in the past tense. **In short, the word representation encodes many different characteristics of the word. If we just used a single Scaled Dot-Product Attention layer, we would only be able to query all of these characteristics in one shot. This is why the Multi-Head Attention layer applies multiple different linear transformations of the values, keys, and queries: this allows the model to apply many different projections of the word representation into different subspaces, each focusing on a subset of the word's characteristics.** Perhpas one of the linear layers will project the word representation into a subspace where all that remains is the information that the word is a verb, another linear layer will extract just the fact that it is past tense, and so on. Then the Scaled Dot-Product Attention layers implement the lookup phase, and finally we concatenate all the results and project them back to the original space.

## Recent Innovations in Language Models

The year 2018 has been called the "ImageNet moment for NLP": progress was astounding, with larger nad larger LSTM and Transformer-based architectures trained on immense datatsets. Some of the following papers were the most influential of this period:

1. The ELMo paper by Matthew Peters introduced *Embeddings from Language Models* (ELMo): these are contextualized word embeddings. For example, the word "queen" will not have the same embedding in "Queen of the United Kingdom" and in "queen bee". <br><br>
2. The ULMFiT paper by Jeremy Howard and Sebastian Ruder demonstrated the effectiveness of unsuperivsed pretraining for NLP tasks: the authors trained an LSTM language model using self-supervised learning (i.e. generating the labels automatically from the data), then they fine-tuned it on various tasks. Their model outperformed the state of the art, reducing the error by 18-24% in most cases. Moreover, they showed that by fine-tuning the pretrained model on just 100 labeled examples, they could achieve the same performance as a model trained from scratch on 10,000 examples.<br><br>
3. The GPT paper by Alec Radford and dother OpenAI researchers also demonstrated the effectiveness of unsupervised pretraining using only Masked Multi-Head Attention layers on a large dataset, once again trained using self-supervised learning. Then they fine-tuned it on various language tasks, using only minor adaptations for each task. Just a few months later, in February 2019, Alec Radford, Jeffrey Wu, and other OpenAI researchers published the GPT-2 paper, which proposed a very similar architecture but larger still (with over 1.5 billion parameters!) and they showed that it could achieve **good performance on many tasks without any fine-tuning. This is called *zero-shot learning* (ZSL)**<br><br>
4. The BERT paper by Jacob Devlin and other Google researchers also demonstrates the effectiveness of self-supervised pretraining on a large corpus, using a similar architecture to GPT but non-masked Multi-Head Attention layers (like the Transformer encoder). Most importantly, the authors proposed two pretraining tasks that explain most of the model's strenght:
   1. *Masked Language Model* (MLM):
      >If the original sentence is "She had fun at the birthday party", then the model may be given the sentence "She <mask> fun at the <mask> party" and it must predict the words "had" and "birthday". Each selected word has an 80% chance of being masked, a 10% chance of being replaced by a random word (to reduce the discrepancy between pretraining and fine-tuning) and a 10% chance of being left alone (to bias the model toward the correct answer).
   2. *Next Sentence Prediction* (NSP)
      >The model is trained to predict whether two sentences are consecutive or not. This is a challenging task, and it significantly improves the performance of the model when it is fine-tuned on tasks such as question answers or entailment

**As you can see, the main innovations in 2018 and 2019 have been better subword tokenization, shifting from LSTMs to Transformers, and pretraining universal language models using self-supervised learning, the nfine-tuning them with very few architectural changes (or none at all).** Things are moving fast; no one can say what architectures will prevail next year. Today, it's clearly Transformers, but tomorrow it might be CNNs. Or it might even be RNNs, if they make a suprise comeback. In the next chapter we will discuss how to learn deep representations in an unsupervised way using autoencoders, and we will use generative adversarial networks (GANs) to produce images and more!

# Exercises

<b>1. What are the pros and cons of using a stateful RNN versus a stateless RNN?

My Answer: <br> Stateless are simpler but cannot learn long-term patterns. Stateful are more complex when it comes to batching and sequence lengths but are capable of learning longer term patterns

Book Answer:

<b>2. Why do people use Encoder-Decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?

My Answer: <br>Encoder-Decoder is superior because: it handles different lengths naturally, it separates the processing of the source task and the writing to target language task, it avoids fixed-vector bottlenecks, it enables alignment and reordering and training is straightforward.

Book Answer:

<b>3. How can you deal with variable-length input sequences? What about variable length output sequences?

My Answer:

Book Answer:

<b>4.What is beam search and why would you use it? What tool can you use to implement it?

My Answer:

Book Answer:

<b>5. What is an attention mechanism? How does it help?

My Answer:

Book Answer:

<b>6. What is the most important layer in the Transformer architecture? What is its purpose?

My Answer:

Book Answer:

<b>7. When would you need to use sampled softmax?

My Answer:

Book Answer:

<b>8. Choose a particular embedded Reber grammar, then train an RNN to identify whether a string respects that grammar or not. You will first need to write a function capable of generating a training batch containing about 50% strings that respect the grammar and 50% that don't.

My Answer:

<b>9. Train an Encoder-Decoder model that can convert a date string from one format to another.

My Answer:

<b>10. Go through TensorFlow's Neural Machine Translation with Attention tutorial (https://homl.info/nmttuto)

My Answer:

<b>11. Use one of the recent language models (e.g. BERT) to generate more convincing Shakespearan text.

My Answer: