# Natural Language Processing with RNNs and Attention

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

2024-03-24 00:03:40.118906: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Generating Shakespearean Text Using a Character RNN

In 2015 blog post, Andrej Karpathy showed how RNN can be used to train a model to predict the next character in the sequence. We will look at how to build Char-RNN, step by step, starting with creation of dataset.

### Creating the Training Dataset

First let's download all of Shakespere's work, using Keras's handy `get_file()` function and downloading the data from Andrej Karpathy's Char-RNN Project:

In [3]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespere.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Downloading data from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


In [None]:
print(shakespeare_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



In [None]:
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [None]:
len("".join(sorted(set(shakespeare_text.lower()))))

39

Next, we must encode every character as an integer. One option is to create a custom preprocessing layer as we did in Chapter-13. But in this case, it will be simpler to use Keras's `Tokenizer` class.

First we need to fit a tokenizer to the text: it will find all the characters used in the text and map each of them to a different character ID, from the 1 to the number of distinct characters (it does not start at 0, as we can use that value for masking, as we will see later):

In [4]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True) # char_level=True means every character will be treated as a token
tokenizer.fit_on_texts([shakespeare_text])

We set `char_level=True` to get the character-level encoding rather than the default word-level encoding. Note that tokenizer converts the text to lowercase by default (but we can set `lower=False` if we do not want that). Now the tokenizer can encode a sentence (or a list of sentences) to a list of character IDs and back, and it tells how many distinct characters are there and total number of characters in the text:

In [None]:
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [None]:
tokenizer.sequences_to_texts([[20,6,9,8,3]])

['f i r s t']

In [5]:
max_id = len(tokenizer.word_index) # number of distinct characters
max_id

39

In [6]:
dataset_size = sum(tokenizer.word_counts.values()) # total number of characters
dataset_size

1115394

In [None]:
len(shakespeare_text)

1115394

Let's encode the full text so each character is represented by its ID (we subtract 1 to get IDs from 0 to 38, rather than 1 to 39):

In [7]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Before we continue, we need to split the dataset into training, a validation and a test set. We can't just shuffle all the characters in the text.

### How to split a Sequential Dataset

> Refer notes

Let's take the first 90% of the text for the training set (keeping the rest for the validation set and the test set), and create a `tf.data.Dataset` that will return each character one by one from this set:

In [8]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

### Chopping the Sequential Dataset into Multiple Windows

The training set now consists of a single sequence of over a million characters, so we can't just train the neural network directly on it: the RNN would be equivalent to a deep net with over a million layers, and we would have a single (very long) instance to train it. Instead, we will use dataset's `window()` method to convert this long sequence of characters into many smaller windows of text. Every instance in the dataset will be a fairly short substring of the whole text, and the RNN will be unrolled only over the length of these substrings. This is called *truncated backpropogation through time*.

Let's call the `window()` method to create a dataset of short text windows:

In [9]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

**TIP:**

We can try tuning `n_steps`: it is easier to train RNNs on shorter input sequences, but of course the RNN will not be able to learn any pattern longer than `n_steps`, so don't make it too small.

By default `window()` method creates a nonoverlapping windows, but to get the largest possible training set we use `shift=1` so that the first window contains characters 0 to 100, the second contains 1 to 101, and so on. To ensure that all the windows are exactly 101 characters long (which will allow to create batches without having to do padding), we set `drop_remainder=True` (otherwise the last 100 windows will contain 100 characters, 99 characters, and so on down to 1 character).

The `window()` creates a dataset that contains windows, each of which is also represented as dataset. It is *nested dataset*, analogous to a list of lists. This is useful when we want to transform each window by calling its dataset methods (e.g., to shuffle them or batch them). However, we cannot use a nested dataset directly for training, as our models will expect tensors as input, not datasets. So we must call the `flat_map()` method: it converts nested dataset into *flat dataset* (one that does not contain datasets). Moreover, the `flat_map()` method takes a function as an argument, which allows us to transform each dataset in the nested dataset before flattening.

In [None]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Notice that we call the `batch(window_length)` on each window: since all windows have exactly the same length, we will get a single tensor for each of them. Now the dataset contains consecutive windows of 101 characters each. Since GD works best when the instances in the training set are independet and indentically distributed , we need to shuffle these windows. Then we can batch the windows and seperate the inputs (the first 100 characters) from the target (the last character):

In [None]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[: ,1:]))

As discussed in Chapter-13, categorical input features should generally be encoded, usually as one-hot vectors or as embeddings. Here we will encode each character using a one-hot vector because they are fairly few distinct characters (only 39):

In [None]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch)
)

dataset = dataset.prefetch(1)

In [None]:
for X_batch , Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


2024-03-13 09:37:01.003966: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


That's it! Preparing the dataset was the hardest part. Now let's create the model

### Building and Training the Char-RNN Model

In [None]:
early_stopping_cb = keras.callbacks.EarlyStopping(monitor="loss", patience=5)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("char-rnn_model.keras",
                                                      monitor="loss",save_best_only=True)

model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], dropout=0.2,
                    recurrent_dropout=0.2), # 20% dropout on both the inputs (dropout) and hidden state (recurrent_dropout) ... Remove recurrent_dropout in both current and below layers when training model on GPU
    keras.layers.GRU(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax")) # numbers of neurons in dense layer = max_id = 39, because we want to predict next character and since we want to predict the probability of next character we use the "softmax" activation function.
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

In [None]:
h = model.fit(dataset, epochs=15, callbacks=[early_stopping_cb, model_checkpoint_cb])

> Model training and inference is done on GPU and its weights have been saved in "char-rnn_model.keras" file. Access collab notebook from [here](https://colab.research.google.com/drive/1GVDdAc-b-ysUxoQ4zUFIYrwLJguLWS8Q?usp=sharing).

In [None]:
model = tf.keras.models.load_model("char-rnn_model.keras")

### Using the Char-RNN Model

Now we have a model that can predict the next character in text written by Shakespeare. To feed it some text, we first need to preprocess it like we did earlier, so let's create a function for this:

In [18]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    # print(tf.one_hot(X, max_id))
    return tf.one_hot(X, max_id)

Now let's use the model to predict the next letter in some text:

In [None]:
X_new = preprocess(["How are yo"])

In [None]:
Y_pred = np.argmax(model(X_new), axis=-1)
Y_pred

In [None]:
tokenizer.sequences_to_texts(Y_pred+1)

In [None]:
tokenizer.sequences_to_texts(Y_pred+1)[0][-1] # last character

Success! The model guessed right. Now let's use this model to generate new text.

### Generating Fake Shakespearean Text

To generate new text using the Char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it at the end of the text, then give the extended text to the model to guess the next letter, and so on. But in practice, it often leads to same words being repeated over and over again.

Instead we can pick the next character randomly, with probability equal to the estimated probability using Tensorflow's `tf.random.categorical()` function. This will generate more diverse text.

The `categorical()` function samples random class indices, given the class log probabilities (logits). To have more control over the diversity of the generated text, we can divide the logits by a number called *temperature*, which we can tweak as we wish: a temperature close to 0 will favour high-probability characters, while a very high tempeature will give all characters an equal probability.

In [None]:
tf.random.set_seed(42)

tf.random.categorical([[np.log(0.5), np.log(0.4), np.log(0.1)]], num_samples=5).numpy()

Let's break it down above code!

Imagine you have a bag of colored marbles, and each color represents a different category. Now, let's say you want to randomly pick marbles from this bag, but not all colors are equally likely to be picked. Some colors might be more likely than others.

The `tf.random.categorical` function is like a magical machine that helps you do this. It's like asking the machine to pick marbles from the bag according to certain rules.

Here's how it works:

1. **Input**: You need to tell the machine about the bag of marbles and how likely each color is to be picked. In technical terms, you provide the machine with a list of numbers called "logits". These logits represent the probabilities of each category. For example, if you have three colors (red, green, blue), you might tell the machine that there's a 50% chance of picking red, 40% chance of picking green, and 10% chance of picking blue. But you provide these probabilities in a special way using logits.

2. **Generate Samples**: Once the machine knows about the probabilities, you ask it to pick marbles from the bag. You tell it how many marbles you want it to pick. In technical terms, you specify the number of samples you want.

3. **Output**: After you ask the machine to pick marbles, it gives you a list of numbers back. Each number represents the color of a marble it picked. These numbers are the indices of the categories you defined earlier. For example, if you have three colors and the machine gives you the numbers 0, 1, 0, 2, 1, it means it picked the first color, then the second color, then the first color again, then the third color, and finally the second color again.

So, in simple terms, `tf.random.categorical` is like a machine that randomly picks items (categories) from a list (distribution) based on how likely you tell it each item is to be picked. It then gives you back the items it picked in the form of a list of numbers.

The following `next_char()` function uses above mentioned approach to pick the next character to add to the input text:

In [19]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model(X_new)[0, -1:, :] # all columns of last row
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [None]:
tf.random.set_seed(42)

next_char("how are yo", temperature=1)

Now, let's write a small function that will call next_char() to get the next character and applied it to given text:

In [21]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
      text += next_char(text, temperature)
    return text

Now we are ready to generate some text! Let's try with different temperatures

In [None]:
print(complete_text("t", temperature=0.2))

In [None]:
print(complete_text("w", temperature=1))

In [None]:
print(complete_text("w", temperature=2))

Apparently our Shakespeare model works best at temperature close to 1. To generate more convincing text, we could try using more GRU layers and more neurons per layer, training for longer, and add some regularization (for example, we could set `recurrent_dropout=0.3` in `GRU` layers). Moreover, the model is currently incapable of learning patterns longer than `n_steps`, which is just 100 characters. We could try making this window larger, but it will also make training harder and even LSTM and GRU cells cannot handle very long sequences.

Alernatively, we could use a stateful RNN.

### Stateful RNN

Until now, we have used only *stateless* RNNs: at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away, as it is not needed anymore. What if we told the RNN to preserve this final state after processing one training batch and use this as the initial state for the next training batch? This way model can learn long-term patterns despite only backpropogating through short sequences. This is called *stateful* RNN. Let's build one.

First, note that the stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. So the first thing we need to do build a stateful RNN is to use sequential and nonoverlapping input sequences (rather than shuffled and overlapping sequences we used to train stateless RNNs).

When creating the `Dataset`, we must therefore use `shift=n_steps` rather than `shift=1` when calling `window()` method. Moreover, we must obviously not call the `shuffle()` method. Batching is much harder when preparing data for stateful RNN than it is for stateless RNN. Indeed, if we were to call `batch(32)`, then 32 consecutive windows would be put in the same batch, and the following batch would not continue each of these window where it left off. The first batch would contain window 1 to 32 and the second batch would contain windows 33 to 64, so if we consider, say, the first window of each batch (i.e., windows 1 and 33), we can say see that they are not consecutive. The simplest solution to this problem is to just use "batches" containing a single window.

In [10]:
tf.random.set_seed(42)

dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Batching is harder but not impossible. For ex: we could chop Shakespeare's text into 32 texts of equal lengths, create one dataset of consecutive input sequences for each of them, and finally use
`tf.train.Dataset.zip(datasets).map(lambda *windows:
tf.stack(windows))` to create proper consecutive batches, where $n^{th}$ input sequence in a batch starts off exactly where the $n^{th}$ input sequence ended in the previous batch.

In [10]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []
for encoded_part in encoded_parts:
    dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    datasets.append(dataset)
dataset = tf.data.Dataset.zip(tuple(datasets)).map(
    lambda *windows: tf.stack(windows))
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

In [11]:
batch_size = 32

# we need to set `stateful=True` in every RL. Stateful RNN also needs to know
# the batch size (since it will preserve a state for each input sequence in the
# batch), so we must set `batch_input_shape` in the first layer. We can leave
# second dimension unspecified, since inputs could have any length

# commenting out `recurrent_dropout=0.2` so that we can use GPU acceleration

model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2,
                    #  recurrent_droupout=0.2,
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2,
                    #  recurrent_droupout=0.2
                     ),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])

At the end of each epoch, we need to reset the states before we got back to the beginning of the text. For this we can use small callback:

In [12]:
class ResetStatesCallback(keras.callbacks.Callback):
  def on_epoch_begin(self, epochs, logs):
    self.model.reset_states()

And now we can compile and fit the model (for more epochs, because each epoch is much shorter than earlier, and there is only one instance per batch):

In [13]:
early_stopping_cb = keras.callbacks.EarlyStopping(monitor="loss", patience=5)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint(
    "char-rnn_stateful_model.keras", monitor="loss",save_best_only=True)

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
h = model.fit(dataset, epochs=100,
                    callbacks=[ResetStatesCallback(), early_stopping_cb,
                               model_checkpoint_cb])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Once we have trained model, it will only be possible to use it to make predictions for batches of the same size as were used during training. To avoid this restriction, we need to create an identical *stateless* model, and copy the stateful model's weights to this model.

In [15]:
# we can get rid of dropout since it is used only during training

stateless_model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])

To set the weights, we first need to build the model (so the weights get created):

In [16]:
stateless_model.build(tf.TensorShape([None, None, max_id]))

In [17]:
stateless_model.set_weights(model.get_weights())
model = stateless_model

In [22]:
tf.random.set_seed(42)

print(complete_text("t"))

tity oud,
dreamsof a his vows want beaugh thon to m


Now that we have built a character-level model, it's time to look at word-level models and tackle a common NLP task: *sentiment analysis*.

## Sentiment Analysis

We will work with IMDb reviews dataset. It consists of 50,000 reviews (25000 for training and 25000 for testing), along with simple binary target for each review indicating whether it is negative (0) or positive (0). 

In [3]:
# loading dataset
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1us/step


`X_train` is a 2D list. Each list is one review

In [8]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

Where are the movie reviews? Well, the dataset is already preprocessed for us: `X_train` consists of a list of reviews, each of which is represented by Numpy array of integers, where each integer represents a word. All punctuation was removed and then words were converted to lowercase, split by spaces, and finally indexed by frequencies (so low integers corresponds to frequent words). The integers 0, 1 and 2 are special: they represent the padding token, the *start-of-sequence* (SSS) token and unknown words respectively. 

To visualize the review, we can decode it like this:

In [10]:
word_index = keras.datasets.imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [16]:
# Print the first few items from word_index
def print_dict(dict, values=5):
    count = 0
    for key, value in dict.items():
        print(key, ":", value)
        count += 1
        if count >= values:  # Change 5 to the number of items you want to print
            break

In [17]:
print_dict(word_index)

fawn : 34701
tsukino : 52006
nunnery : 52007
sonja : 16816
vani : 63951


In [14]:
id_to_word = {id_ + 3: word for word, id_ in word_index.items()} # we are adding 3 to every id since first 3 ids are special one

In [18]:
print_dict(id_to_word)

34704 : fawn
52009 : tsukino
52010 : nunnery
16819 : sonja
63954 : vani


In [20]:
# we are adding 3 special characters to the id_to_word dict
for id_ , token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token

In [22]:
[id_to_word[id_] for id_ in X_train[0][:10]]

['<sos>',
 'this',
 'film',
 'was',
 'just',
 'brilliant',
 'casting',
 'location',
 'scenery',
 'story']

In [24]:
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

In real project, we will have to preprocess the text ourself. We can do that using the `Tokenizer` class we used earlier, but this time setting `char_level=False` (which is default). When encoding words, it filters out a lot of characters, including most punctutations, line breaks and tabs (but we can change this setting by the `filters` argument). This is OK for English and many other scripts that use spaces between words, but not all scripts use spaces this way. Even in English, spaces are not always the best way to tokenize text: think of "San Francisco" or "#ILoveDeepLearning".

Forunately there are better options. 
> Refer notes for better options

If we want to deploy our model to mobile devices or a web browser, and we don't want to have write a different preprocessing function every time, then we will want to handle preprocessing using only TF operations, so it can be included in model itself. Let's see how.

Let's load the original IMDb reviews as text (byte strings), using Tensorflow Datasets

In [25]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples

Next, let's write preprocessing function:

In [28]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

It starts by truncating the reviews, keeping only the first 300 characters of each: this will speed up training, and it won't impact performance too much because we can generally tell whether a review is positive or not in the first sentence or two. Then it uses *regular expressions* to replace `<br/>` with spaces, and to replace any characters other than letters and quotes with spaces. For example, the text `"Well, I can't<br />"` will become `"Well, I can't"`. Finally the `preprocess()` function splits the reviews by spaces, which returns ragged tensor, and it converts this ragged tensor to a dense tensor, padding all reviews with padding token `"<pad>"` so that they all have same length. 

Next, we need to construct the vocabulary. This requires going through the whole training set once, applying our `preprocess()` function, and using a `Counter` to counter number of occurences of each word. 

In [33]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

2024-03-14 21:50:21.724773: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Let's look at the three most common words:

In [34]:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

Great! We probably don't we need our model to know all the words in the dictionary to get good performance, though, so let's truncate the vocabulary, keeping only the 10,000 most common words:

In [35]:
vocab_size=10000
truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]

Now we need to add a preprocessing step to replace each word with its ID (i.e., its index in the vocabulary). Just like we did in Chapter-13, we will create a lookup table for this, using 1000 out-of-vocabulary (oov) buckets:

In [36]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

We can then use this table to look up the IDs of a few words:

In [37]:
table.lookup(tf.constant([b"This movie was faaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10791]])>

Note that the words "this", "movie", "was" were found in the table, so their IDs are lower than 10,000, while the word "faaaaantastic" was not found, so it was mapped to one of the oov buckets, with an ID greater than or equal to 10,000. 

**TIP:**

TF Transform (introduced in Chapter 13) provides some useful functions to handle
such vocabularies. For example, check out the
`tft.compute_and_apply_vocabulary()` function: it will go through the dataset to
find all distinct words and build the vocabulary, and it will generate the TF
operations required to encode each word using this vocabulary.

Now we are ready to create the final training set. We batch the reviews, then convert them to short sequences of words using the `preprocess()` function, then encode these words using a simple `encode_words()` function that uses the table we just built, and finally prefetch the next batch:

In [39]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

At last we can create the model and train it:

In [41]:
embed_size = 128

model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
h = model.fit(train_set, epochs=1)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m92s[0m 114ms/step - accuracy: 0.5451 - loss: 0.6740


> Will be trained on GPU

The first layer is the `Embedding` layer, which will convert word IDs into embeddings. The embedding matrix needs to have one row per word ID (vocab_size + num_oov_buckets) and one column per embedding dimension (here we have used 128 dimensions, but this is hyperparameter that we can tune). Whereas the inputs of the model will be 2D tensors of shape [*batch_size*, *time_steps*], the output of the `Embedding` layer will be a 3D tensor of shape [*batch size, time steps, embedding size*].

### Masking

As it stands, the model will need to learn that the padding tokens should be ignored. But we already know that! We need to tell the model to ignore that so that it can focus on the data that actually matters. It's actually trivial: simply add `mask_zero=True` when creating `Embedding` layer. This means that padding tokens (whose ID is 0) will be ignored by all downstream layers.

> It is good idea to give 0 ID to the padding tokens. Here padding token are the most frequent words so they have 0 ID, but if they are not frequent, then we should make ID of it as 0.

That's all!

In [None]:
embed_size = 128

model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, mask_zero=True, input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
h = model.fit(train_set, epochs=1)

Working:

The `Embedding` layer creates a *mask tensor* equal to `K.not_equal(inputs, 0)` (where `K=keras.backend`): it is a Boolean tensor with same shape as inputs and it is equal to `False` anywhere the word IDs are 0, or `True` otherwise. This mask tensor is then automatically propogated by the model to all subsequent layers, as long as the time dimensions are preserved. So in this example, both `GRU` layers will receive the mask automatically, but since the second `GRU` layer does not return sequences (it only returns the outputs of the last time step), the mask will not be transmitted to the `Dense` layer. Each layer may handle the mask differently, but in general they simply ignore the masked time steps (i.e., time steps for which the mask is `False`). For example, when a RL encounters a masked time step, it simply copies the output from the previous time step. 

**Warning:**

The LSTM and GRU layers have an optimized implementation for GPUs, based on
Nvidia’s cuDNN library. However, this implementation does not support masking. If
your model uses a mask, then these layers will fall back to the (much slower) default
implementation.


All the layers that receives the mask must support masking (or else an exception will be raised). This includes all RL, as well as `TimeDistributed` layer and few other layers. Any layer that supports masking must have an `support_masking` attribute equal to `True`.

If we want to implement our own custom layer with masking support, we should add a `mask` argument to `call()` method. Additionally, we should set `self.support_masking=True` in the constructor. If the layer does not start with an `Embedding` layer, we may use the `keras.layers.Masking` layer instead: it sets the mask to `K.any(K.not_equal(inputs,0), axis=-1)`, meaning that time steps where the last dimension is full of zeros will be masked out in subsequent layers (again, as long as the time dimension exists). 

Using masking layers and automatic mask propogation works best for simple `Sequential` methods. It will not always work for more complex models, such as when you need to mix `Conv1D` layers with recurrent layers. In such cases, you will need to explicitly compute the mask and pass it to the appropriate layers, using it either the Functional API or the Subclassing API. For ex: the following model is identical to the previous model, except it is build using the Functional API and handles masking manually:

In [None]:
K = keras.backend
inputs = keras.layers.Input(shape=[None])
mask = keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences=True)(z, mask=mask)
z = keras.layers.GRU(128)(z, mask=mask)
outputs = keras.layers.Dense(1, activation="sigmoid")(z)
model = keras.Model(inputs=[inputs], outputs=[outputs])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

After training for few epochs, this model will become quite good at judging whether a review is positive or not. 

### Reusing Pretrained Embeddings

The TensorFlow Hub project makes it easy to reuse pretrained model components in our own models. These model components are called *modules*. Simply browse [TF Hub Repository](https://tfhub.dev/), find the one that we need, and copy the code example into our project and the module will be automatically downloaded, along with pretrained weights. Easy!

For ex: let's use the `nnlm-en-dim50` sentence embedding module, version 1, in our sentiment analysis model:

In [None]:
import tensorflow_hub as hub

model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", dtype=tf.string,
                  input_shape=[], output_shape=[50]),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

The `hub.KerasLayer` downloads the module from the given URL. This particular module is *sentence encoder*: it takes string as input and encodes each one as a single vector (in this case, a 50-dimensional vector). Internally, it parses the string (splitting words on spaces) and embedds each word using an embedding matrix that was pretrained on a huge corpus: the Google News 7B corpus. Then it computes the mean of all the word embeddings and the result is the sentence embedding. We can then add two simple `Dense` layers to create a good sentiment analysis model. By default, a `hub.KerasLayer` is not trainable, but we can set `trainable=True` when creating it to change that so that we can fine-tune it for our task.

Next, we can just load the IMDb reviews dataset - no need to preprocess it (except for batching and prefetching) - and directly train the model:

In [None]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
batch_size=32
train_set = datasets["train"].batch(batch_size).prefetch(1)
h = model.fit(train_set, epochs=5)

By default, TF Hub will cache the downloaded files into
the local system’s temporary directory. You may prefer to download them
into a more permanent directory to avoid having to download them again
after every system cleanup. To do that, set the `TFHUB_CACHE_DIR`
environment variable to the directory of your choice (e.g.,
`os.environ["TFHUB_CACHE_DIR"] = "./my_tfhub_cache"`).

So far, we have looked at time series, text generation using Char-RNN, and sentiment analysis using word-level RNN models, training our own word embeddings or using pretrained embeddings. 

Let's now look at another important NLP task: *neural machine translation* (NMT), first using a pure Encoder-Decoder model, then improving it with attention mechanisms and finally looking the extraordinary Transformer architecture. 

## An Encoder-Decoder Model for Neural Machine Translation

The Tensorflow Addons project includes many seq-to-seq tools to let us easily build production-ready Encoder-Decoders. For example, the following code creates a basic Encoder-Decoder model, similar to one represented in the figure:
> Figure drawn in notes

In [None]:
import tensorflow_addons as tfa

vocab_size = 100
embed_size = 10

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]

sampler = tfa.seq2seq.sampler.TrainingSampler()

decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler, output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(decoder_embeddings, 
                                                             initial_state=encoder_state,
                                                            sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)

model = keras.Model(inputs=[encoder_inputs, decoder_inputs, sequence_lengths], outputs=[Y_proba])

Explanation:

First, we set `return_state=True` when creating LSTM layer so that we can get its final hidden state and pass it to decoder. Since we are using an LSTM Cell, it actually returns two hidden states (short term and long term). The `TrainingSampler` is one of several samplers available in Tensorflow Addons: their role is to tell the decoder at each step what is should pretend the previous output was. During inference, this should be the embedding of the previous target token: this is why we used `TrainingSampler`. In practice, it is often good idea to start training with the embedding of the target of the previous step and gradually transition to using the embedding of the actual token that was output at the previous step. This idea was introduced in 2015 paper. The `ScheduledEmbeddingTrainingSampler` will randomly choose between the target or the actual output, with probability that we can gradually change during training.

### Bidirectional RNNs

To implement this, run two recurrent layers on the same inputs, one reading the words from left to right  and the other reading them from right to left. Then simply combine their outputs at each time step, typically by concatenating them. This is called *bidirectional recurrent layer*.

To implement a bidirectional recurrent layer, wrap a recurrent layer in Keras, wrap a recurrent layer in `keras.layers.Bidirectional` layer:

In [None]:
keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True))

**NOTE:**

The `Bidirectional` layer will create a clone of `GRU` layer (but in reverse direction), and it will run both and concatenate their outputs. So although the `GRU` layer has 10 units, the `Bidirectional` layer will output 20 values per time step.

### Beam Search

We can implement beam search fairly easily using Tensorflow Addons:

In [None]:
beam_width = 10

decoder = tfa.seq2seq.beam_search_decoder.BeamSearchDecoder(cell=decoder_cell, beam_width=beam_width,
                                                           output_layer=output_layer)
decoder_initial_state = tfa.seq2seq.beam_search_decoder.tile_batch(encoder_state, 
                                                                  multiplier=beam_width)
outputs, _, _ = decoder(embedding_decoder, start_tokens=start_tokens, end_token=end_token,
                       initial_state=decoder_initial_state)

We first create a `BeamSearchDecoder`, which wraps all decoder clones (in this case 10 clones). Then we create one copy of the encoder's final state for each decoder clone, and we pass these states to the decoder, along with start and end tokens.

With all this, we can get good translations for fairly short sequences (especially if we use pretrained word embeddings). Unfortunately, this model will be really bad at translating long sentenes. *Attention mechanisms* are good game-changing innovation that addressed this problem.

## Attention Mechanisms

#### Bahdanau Attention

#### Luong Attention

Here's how we can add Luong attention to Encoder-Decoder model using Tensorflow addons:

In [None]:
attention_mechanism = tfa.seq2seq.attention_wrapper.LuongAttention(units, encoder_state, 
                                      memeory_sequence_length=encoder_sequence_length)
attention_decoder_cell = tfa.seq2seq.attention_wrapper.AttentionWrapper(
                decoder_cell, attention_mechanism, attention_layer_size=n_units)

We simply wrap the decoder cell in an `AttentionWrapper`, and we provide the desired attention mechanism

### Visual Attention