# Natural Language Processing with RNNs and Attention
This notebook will take a look at how RNNs can be used for natural language processing (NLP) and
the techniques state-of-the-art algorithms use to succeed in this task

## Index

[Generating Shakespearean Text Using a Character RNN](#Generating-Shakespearean-Text-Using-a-Character-RNN)

[Sentiment Analysis](#Sentiment-Analysis)

## Generating Shakespearean Text Using a Character RNN
In 2015 a blog post titled *The Unreasonable Effectiveness of Recurrent Neural Networks* showed
how to train a *Char-RNN* to predict the next character in a sentence. This Char-RNN can be used
to generate novel text, one character at a time.

### Creating the Training Dataset
All of Shakespeare's work can be downloaded from the
[blogs GitHub](#https://github.com/karpathy/char-rnn) using Keras useful ```get_file()```
function.

Next every character needs to be encoded as an integer. For this we can use Keras
```Tokenizer``` class. A tokenizer needs to be fit to the text: it will find all the characters
used in the text and map each of them to a different character ID, from 1 to the number of
distinct characters. ```char_level=True``` so that we get character level encoding, instead of
the default word level encoding. The tokenizer can encode a sentence to a list of character IDs
and back, and it tells us how many distinct characters there are, and the total number of
characters in the text

In [80]:
# Import modules
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [81]:
# Downloading the dataset
shakespeare_url = "https://homl.info/shakespeare"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)

with open(filepath) as f:
    shakespeare_text = f.read()

print(shakespeare_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



In [82]:
# Tokenize
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

In [83]:
# Token EDA
print(tokenizer.texts_to_sequences(["First"]))
print(tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]]))
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

[[20, 6, 9, 8, 3]]
['f i r s t']


The tokenizer does not start at 0, it starts at 1. It does this, so we can use the value of 0 for
masking. This next section encodes the full text so each character is represented by its ID. 1 is
 subtracted to get IDs from 0 to 38, rather than 1 to 39.


In [84]:
# Encoding
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

### Splitting the Sequential Dataset
it is very important to avoid overlap between the training set, the validation set and the test
set. For example, we can take 90% of the text for the training set, then the next 5% for the
validation set and then 5% for the final test set. It is a good idea to leave a gap between the
sets to avoid the risk of paragraph overlapping the two sets.

Splitting across time implicitly assumes that the patterns the RNN can learn in the past
(training set) will exist in the future. We assume that the data is *stationary* This assumption
is valid in some datasets but not in others.

To make sure that the time series is indeed sufficiently stationary, the model's error on the
validation set can be plotted across time: if the model performs much better on the first part of
 the validation set than on the last part, the time series may not be stationary enough, and
 the model should be trained on a shorter time span.

The first 90% of the Shakespearean dataset will be used for training. The rest will be the
validation and test set


In [85]:
# Training
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])


### Chopping the Sequential Dataset into Multiple Windows
The training set now consists of a single sequence of over a million characters. The dataset
```window()``` method needs to be used to convert this long sequence of characters into many
smaller windows of text. Every instance in the dataset will be a fairly short substring of the
whole text and the RNN will be unrolled only over the length of these substrings. This is called
*truncated backpropagation through time*.

In [86]:
# Dataset window
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

```shift=1``` is used to the first window contains characters 0 to 100, the second contains 1 to
101 and so on. To ensure all windows are 101 characters long ```drop_remainder=True``` is used.

The ```window()``` method creates a dataset that contains windows, each of which is represented
as a dataset. It is a *nested dataset*. The model expects tensors as input and therefore we must
call the ```flat_map()``` method: it converts a nested dataset ino a *flat dataset*. The
```flat_map()``` method takes a function as an argument, which allows us to transform each
dataset in the nested dataset before flattening.

Since Gradient Descent works best when instances in the training set are independent and
identically distributed, we need to shuffle these windows.

In [87]:
# flat_map()
dataset = dataset.flat_map(lambda window: window.batch(window_length))

# Seed
np.random.seed(42)
tf.random.set_seed(42)

In [88]:
# Batch
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

Categorical input features should generally be encoded as one-hot vectors or embeddings.

In [89]:
# OHE
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

In [90]:
# TF GPU
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


In [92]:
### Building and Training the Char-RNN Model
#with tf.device("gpu:0"):
print("tf.keras code in this scope will run on GPU \n")
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], dropout=0.2,
                     recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="Adam")
history = model.fit(dataset, epochs=5)

tf.keras code in this scope will run on GPU 

Epoch 1/5
   5502/Unknown - 1261s 229ms/step - loss: 1.6707

KeyboardInterrupt: 

### Using the Char-RNN Model
To predict the next character, the text fed in as input must first be preprocessed.

In [93]:
# Preprocess
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1

    return tf.one_hot(X, max_id)

X_new = preprocess(["How are yo"])
Y_pred = model.predict_classes(X_new)

print(tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]) # 1st sentence, last character

u


### Generating Fake Shakespearean Text
To generate new text using the Char-RNN model, we could feed it some text, make the model predcit
 the most liekelt next letter, add it at the end of the text, then give it the extened text to
 the model to guess the next letter, and so on. This, in practice, can lead to the same words
 being repeated over and over again.

Instead, ```tf.random.categorical()``` function can be used to generate diverse and interesting
text. The ```categorical()``` function samples random class indices, given the class log
probabilities (logits).

In [103]:
# Random
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [104]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)

    return  text

In [105]:
# Generate text
print(complete_text("t", temperature=0.2))

the country so so see thee,
and the son, and thou s


In [106]:
print(complete_text("t", temperature=1))


ter:
i droble consprint joy thou drance.

lady anne


In [107]:
print(complete_text("t", temperature=2))

t!--it?
my, go?
frepal thy in rull-your quard, me e


Not the smartest model and training may be enhanced to improve predictive capabilities. More
```GRU``` layers may be used and more neurons, epochs, regularization. The current model is
incapable of learning patterns longer than ```n_steps``` which is 100 characters. Making this
larger, will make training harder. Stateful RNNs can help with this.

### Statefull RNNs
So far this paper has used *stateless* RNNs: at each training iteration the models starts with a
hidden state full of zerosz, then it updates this state at each time step, and adfter the last
time step, it throws it awat, as it is not needed anymore. We can tell the RNN to preserve this
final state after processing one training bvatchi and use it as tintial state for the next
training batch. This way the model can learn long-term patters despite only backpropagating
through short sequences. This is called a *stateful* RNN.

A stateful RNN only makes sense if each input sequence in a batch starts exactly where the
corresponding sequence in the previous batch left off. A sequential and nonoverlapping input
sequence must be used.

In [108]:
# Sequential input
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

In [109]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2,
                     recurrent_dropout=0.2, batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])

At the end of each epoch, the states need to be reset. We can create a callback

In [110]:
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs=None):
        self.model.reset_states()

In [111]:
# Model compile
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
model.fit(dataset, epochs=20, callbacks=[ResetStatesCallback()])

Epoch 1/20
      1/Unknown - 4s 4s/step

InvalidArgumentError:  Specified a list with shape [32,39] from a tensor with shape [1,39]
	 [[node sequential_10/gru_20/TensorArrayUnstack/TensorListFromTensor (defined at <ipython-input-111-e5aced6fd63d>:3) ]] [Op:__inference_distributed_function_119678]

Function call stack:
distributed_function


## Sentiment Analysis
This section will manipulate the famous IMDb dataset and it will build a sentiment classifier


In [112]:
# Dataset
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()
X_train[0][:10]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

In [116]:
# Decoding review
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

In a real project with raw data, it would need to be processed uniquely.


In [121]:
# Dpeloying model structure
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples

In [134]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

Building the vocabulary involves going throught the whole training set once, applying the
```preprocess()``` fucntion and using a ```Counter``` to count the number of occurances of each word

In [136]:
# Vocabulary
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

In [137]:
# Most common
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [138]:
vocab_size = 10000
truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]

In [139]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

In [140]:
# Lookup
table.lookup(tf.constant([b"This move was faaaaaaaaaaaaaantastic".split()]))


<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,   951,    11, 10785]], dtype=int64)>

The words "this", "movie" and "was" were found in the tbales so there IDs are lower than 10000,
while the word "faaaaaaaaaaaaaantastic" was not found, so it was mapped to one of the oov buckets.



In [142]:
# Encode words
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [146]:
# Model
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="Adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Reusing Pretrained Embeddings
TensorFlow Hub project makes it easy to reuse pretrained model components in your own models.
These model components are called *modules*. It downloads the module along with its pretrained
weights, and it includes them in your model.

This section will use TensorFlow Hub to download the ```nnlm-en-dim50``` sentence embedding module.

In [153]:
import tensorflow_hub as hub

model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", dtype=tf.string,
                   input_shape=[], output_shape=[50]),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [154]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
batch_size = 32
train_set = datasets["train"].repeat().batch(batch_size).prefetch(1)
history = model.fit(train_set, steps_per_epoch=train_size // batch_size, epochs=5)

Train for 781 steps
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
