# **Udacity: Intro to TensorFlow for Deep Learning**
## **Lesson 10 NLP: Recurrent Neural Networks**

This lessons extends on what was covered in lesson 9. It introduces recurrent neural networks, which are able to capture temporal dependences, that change over time.

This lesson would covers
- Different RNNs: Simple RNN, LSTMS, GRUs
- Text generation using NLP models.

<br>

I've split lesson 10 into 2 parts. This section covers text generation using NLP.

<br>

**Note**   
Part 1 of this lesson explored different recurrent neural networks and CNN 1D that can be trained on embeddings from a sequence. I've gone through the basic variations of RNNs (simpleRNN, LSTMs and GRU).


## **Text Generation**

We can use the methods learnt so far with some slight modification to generate new text. Summary on text generation approach.   

**1. Preapre text: Tokenize, sequence and create embedding**
  - Key difference: generate more samples from an individual sequence
  - From an individual sequence we can generate more sequences by increasing length untill we reach the maximum length of the sequence
  - The generated sequences would be padded to have the same length.
  - The last value in each generated sequence would be used as the label
  - one hot encode the labels

```python
  # initial sequence 
  [81, 82, 142, 197, 29, 4, 287, 197]

  # generated sequences
  [81, 82]
  [81, 82, 142]
  [81, 82, 142, 197]
  [81, 82, 142, 197, 29]
  [81, 82, 142, 197, 29, 4]
  [81, 82, 142, 197, 29, 4, 287]
  [81, 82, 142, 197, 29, 4, 287, 197]

  # padded sequences
  [0, 0, 0, 0, 0, 0, 81, 82]
  [0, 0, 0, 0, 0, 81, 82, 142]
  [0, 0, 0, 0, 81, 82, 142, 197]
  [0, 0, 0, 81, 82, 142, 197, 29]
  [0, 0, 81, 82, 142, 197, 29, 4]
  [0, 81, 82, 142, 197, 29, 4, 287]
  [81, 82, 142, 197, 29, 4, 287, 197]

  # Split into training and label
  [0, 0, 0, 0, 0, 0, 81],  [82]
  [0, 0, 0, 0, 0, 81, 82], [142]
  [0, 0, 0, 0, 81, 82, 142], [197]
  [0, 0, 0, 81, 82, 142, 197], [29]
  [0, 0, 81, 82, 142, 197, 29], [4]
  [0, 81, 82, 142, 197, 29, 4], [287]
  [81, 82, 142, 197, 29, 4, 287], 197]
```
**2. Define model architecture, loss, metrics and activation**
  - Define a model to perform multi-class categorization
  - use the required activation function and define the expected number of possible classes at the last class

**3. Train model**
- Train the final model on the padded generated sequence and one-hot encodded labels

<br>

**Notes**
- To create more variance in the generated text, as opposed to selecting the next most probable word in the distribution, the probability distribution can be used to define the odd of selecting a word.

- After the model as been trained, we would define a seed sentence, from which we would generate new text from. The generated text from the seed sentence are then feed in recursively untill (i assume) the length of the sentence matches the max length of the sequence.

## **Text Generation in Code**

### **Import dependencies**

Import dependencies needed for this notebook

In [None]:
import tensorflow as tf
import numpy as np

print(tf.__version__)

2.8.2


### **Get and prepare text for NLP model**

To do things differently, i've had a look at catlog of text datasets provided in tensorflow datasets. The [tiny_shakespeare](https://www.tensorflow.org/datasets/catalog/tiny_shakespeare) seems useful for this notebook.

**Get the training, validation and test text**

In [None]:
import tensorflow_datasets as tfds
print(tfds.__version__)

4.0.1


In [None]:
train_text, test_text, validation_text = tfds.load(name='tiny_shakespeare',
                                                   split=['train', 'test', 'validation'])

[1mDownloading and preparing dataset tiny_shakespeare/1.0.0 (download: Unknown size, generated: 1.06 MiB, total: 1.06 MiB) to /root/tensorflow_datasets/tiny_shakespeare/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteP08XVD/tiny_shakespeare-train.tfrecord


  0%|          | 0/1 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteP08XVD/tiny_shakespeare-validation.tfrecord


  0%|          | 0/1 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteP08XVD/tiny_shakespeare-test.tfrecord


  0%|          | 0/1 [00:00<?, ? examples/s]

[1mDataset tiny_shakespeare downloaded and prepared to /root/tensorflow_datasets/tiny_shakespeare/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
print(f"Number of training sample: {len(train_text)}")
print(f"Number of test sample: {len(test_text)}")
print(f"Number of validation sample: {len(validation_text)}")

Number of training sample: 1
Number of test sample: 1
Number of validation sample: 1


**Prepare the training text data**

In [None]:
# view the contents of the training text
# need to split the string such that we have multiple lines of text
for string in train_text:
  byte_string = string['text'].numpy()

print(byte_string)



In [None]:
lines_from_shakespeare = []

for string in tf.strings.split(byte_string, sep='\n'):
  decoded_string = string.numpy().decode('UTF-8')
  if ":" not in decoded_string and (decoded_string != ""):
    lines_from_shakespeare.append(decoded_string)

print(lines_from_shakespeare)
print(f"Number of lines: {len(lines_from_shakespeare)}")

Number of lines: 20235


**Initial set of preprocessing**
- remove the character name
- get individual lines of dialog

In [None]:
# define a function to encapsulate the above process
def extract_text(dataset):
  lines_from_shakespeare = []

  # extract the byte string
  for string in train_text:
    byte_string = string['text'].numpy()

  # split and filter the string
  for string in tf.strings.split(byte_string, sep='\n'):
    decoded_string = string.numpy().decode('UTF-8')
    if ":" not in decoded_string and (decoded_string != ""):
      lines_from_shakespeare.append(decoded_string)
  
  return lines_from_shakespeare

**define a tokenizer and fit it on the dialog**

In [None]:
# define a tokenizer and fit it to the lines of dialog
shakespeare_tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=1000, oov_token="<OOV>")
shakespeare_tokenizer.fit_on_texts(lines_from_shakespeare)


In [None]:
print(shakespeare_tokenizer.word_index)



**Convert the text into sequences**

In [None]:
# convert the lines of dialog into sequences
training_sequences = shakespeare_tokenizer.texts_to_sequences(lines_from_shakespeare)

print(training_sequences)
print(len(training_sequences))

[[135, 33, 914, 130, 523, 119, 18, 103], [103, 103], [8, 41, 35, 1, 307, 4, 175, 62, 4, 1], [1, 1], [211, 8, 92, 915, 270, 12, 1, 495, 4, 2, 239], [33, 1, 33, 1], [71, 77, 395, 27, 3, 331, 25, 1, 53, 37, 144, 1], [569, 9, 1], [68, 209, 47, 834], [33, 41, 1, 152, 834, 2, 1, 47], [54, 454, 77, 22, 2, 1, 332, 17, 72], [1, 33, 219, 835, 58, 1, 77, 1], [1, 77, 2, 1, 6, 37, 1, 12, 24, 83], [1, 4, 1, 60, 1, 37], [1, 12, 9, 1, 4, 64, 71, 77, 468, 21, 15], [103, 21, 11, 1, 14, 1, 13, 11, 1, 14, 468], [54, 8, 914, 1, 147, 915, 270], [1, 8, 28, 1, 23, 279, 156, 14, 20, 553], [149, 65, 3, 180, 19, 535, 4, 104, 27, 47], [524, 1, 22, 10, 23, 1, 212, 15, 131, 371], [190, 22, 103, 13, 1], [5, 70, 199, 8, 28, 23, 66, 156, 1, 23, 91], [535, 4, 70, 17, 57, 14, 20, 553, 23, 91, 17, 4], [275, 20, 188, 3, 4, 19, 1, 371, 48, 23], [12, 167, 163, 2, 1, 6, 20, 570], [28, 23, 132, 246, 11, 20, 343, 8, 916, 9], [1, 11, 27, 8, 85, 11, 36, 169, 70, 23, 12, 1], [39, 5, 85, 13, 5, 433, 13, 19, 1, 6, 1], [23, 66, 622,

In [11]:
training_sequences[0]

[135, 33, 914, 130, 523, 119, 18, 103]

**Create more sequences from the initial set of sequences**

In [12]:
# from the sequence generate n-grams of the sequence
# Admittedly i don't know how to generate more sequences from the initial set so i've had a look at the lesson code

n_gram_training_sequence = []

for sequence in training_sequences:
  for i in range(1, len(sequence)):
    n_gram = sequence[:i+1]
    n_gram_training_sequence.append(n_gram)

print(len(n_gram_training_sequence))


132089


yikes, that ~130k training samples

In [13]:
# display some of the sequences
print(n_gram_training_sequence[:20])

[[135, 33], [135, 33, 914], [135, 33, 914, 130], [135, 33, 914, 130, 523], [135, 33, 914, 130, 523, 119], [135, 33, 914, 130, 523, 119, 18], [135, 33, 914, 130, 523, 119, 18, 103], [103, 103], [8, 41], [8, 41, 35], [8, 41, 35, 1], [8, 41, 35, 1, 307], [8, 41, 35, 1, 307, 4], [8, 41, 35, 1, 307, 4, 175], [8, 41, 35, 1, 307, 4, 175, 62], [8, 41, 35, 1, 307, 4, 175, 62, 4], [8, 41, 35, 1, 307, 4, 175, 62, 4, 1], [1, 1], [211, 8], [211, 8, 92]]


**Apply padding to the sequences**

In [14]:
# Apply padding to the sequences and then split it into a feature and label
max_sequence = max([len(sequence) for sequence in n_gram_training_sequence])
print(f"Length of the longest sequence is {max_sequence}")



Length of the longest sequence is 15


In [15]:
# Apply padding
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequence = pad_sequences(n_gram_training_sequence, maxlen=15,
                                padding='post', truncating='post')


In [17]:
# View the padded sequences
print(padded_sequence[:20])

[[135  33   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [135  33 914   0   0   0   0   0   0   0   0   0   0   0   0]
 [135  33 914 130   0   0   0   0   0   0   0   0   0   0   0]
 [135  33 914 130 523   0   0   0   0   0   0   0   0   0   0]
 [135  33 914 130 523 119   0   0   0   0   0   0   0   0   0]
 [135  33 914 130 523 119  18   0   0   0   0   0   0   0   0]
 [135  33 914 130 523 119  18 103   0   0   0   0   0   0   0]
 [103 103   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  8  41   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  8  41  35   0   0   0   0   0   0   0   0   0   0   0   0]
 [  8  41  35   1   0   0   0   0   0   0   0   0   0   0   0]
 [  8  41  35   1 307   0   0   0   0   0   0   0   0   0   0]
 [  8  41  35   1 307   4   0   0   0   0   0   0   0   0   0]
 [  8  41  35   1 307   4 175   0   0   0   0   0   0   0   0]
 [  8  41  35   1 307   4 175  62   0   0   0   0   0   0   0]
 [  8  41  35   1 307   4 175  62   4   0   0   0   0  

i've left the above 2 cells, just to give some consideration on how to split the data into features and label.

With padding and truncating applied at the end of the sequence what is the best way to split the data? At the moment i'm not sure. *I might revist this later* but it raises concern if we can only do text generation only with pre padding and truncating 

**Split the padded sequences into features and labels**

In [18]:
# Apply pre padding
padded_sequence = pad_sequences(n_gram_training_sequence, maxlen=15,
                                padding='pre', truncating='post') # remove from the end of line when truncating.

# View the padded sequences
print(padded_sequence[:20])

[[  0   0   0   0   0   0   0   0   0   0   0   0   0 135  33]
 [  0   0   0   0   0   0   0   0   0   0   0   0 135  33 914]
 [  0   0   0   0   0   0   0   0   0   0   0 135  33 914 130]
 [  0   0   0   0   0   0   0   0   0   0 135  33 914 130 523]
 [  0   0   0   0   0   0   0   0   0 135  33 914 130 523 119]
 [  0   0   0   0   0   0   0   0 135  33 914 130 523 119  18]
 [  0   0   0   0   0   0   0 135  33 914 130 523 119  18 103]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0 103 103]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   8  41]
 [  0   0   0   0   0   0   0   0   0   0   0   0   8  41  35]
 [  0   0   0   0   0   0   0   0   0   0   0   8  41  35   1]
 [  0   0   0   0   0   0   0   0   0   0   8  41  35   1 307]
 [  0   0   0   0   0   0   0   0   0   8  41  35   1 307   4]
 [  0   0   0   0   0   0   0   0   8  41  35   1 307   4 175]
 [  0   0   0   0   0   0   0   8  41  35   1 307   4 175  62]
 [  0   0   0   0   0   0   8  41  35   1 307   4 175  

In [19]:
# Split the sequence into features and label
training_sequences_features = padded_sequence[:,:-1]
print(training_sequences_features[:20])

[[  0   0   0   0   0   0   0   0   0   0   0   0   0 135]
 [  0   0   0   0   0   0   0   0   0   0   0   0 135  33]
 [  0   0   0   0   0   0   0   0   0   0   0 135  33 914]
 [  0   0   0   0   0   0   0   0   0   0 135  33 914 130]
 [  0   0   0   0   0   0   0   0   0 135  33 914 130 523]
 [  0   0   0   0   0   0   0   0 135  33 914 130 523 119]
 [  0   0   0   0   0   0   0 135  33 914 130 523 119  18]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0 103]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   8]
 [  0   0   0   0   0   0   0   0   0   0   0   0   8  41]
 [  0   0   0   0   0   0   0   0   0   0   0   8  41  35]
 [  0   0   0   0   0   0   0   0   0   0   8  41  35   1]
 [  0   0   0   0   0   0   0   0   0   8  41  35   1 307]
 [  0   0   0   0   0   0   0   0   8  41  35   1 307   4]
 [  0   0   0   0   0   0   0   8  41  35   1 307   4 175]
 [  0   0   0   0   0   0   8  41  35   1 307   4 175  62]
 [  0   0   0   0   0   8  41  35   1 307   4 175  62   

In [21]:
# get the label
training_sequences_label = padded_sequence[:,-1]
print(training_sequences_label[:20])

[ 33 914 130 523 119  18 103 103  41  35   1 307   4 175  62   4   1   1
   8  92]


**Create a one-hot encoding for each label**

In [22]:
# create a one-hot encoding of each label
training_sequences_label_encoded = tf.keras.utils.to_categorical(training_sequences_label, num_classes=1000)

print(training_sequences_label_encoded[:20])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Summary of steps taken so far   
1. Get the data
2. perform an initial set of cleaning on the data
3. Perform tokenization on the text data
4. Create sequences from the text data
  - From each sequence generate more sequence
  - Padded the total combined sequences
5. Split the sequence into features and labels. With the label being the last token in the sequence
6. Encode the labels using one-hot vectors

I wonder is this the accepted approach for text generation??

**Repeat the above steps on the validation and test data**

In [23]:
# Apply whole data preparation process to the test and validation data.
# Note use the tokenizer that has been fitted on the training data

# copy of the initial function to pre-process the text dataset
def extract_text(dataset):
  lines_from_shakespeare = []

  # extract the byte string
  for string in train_text:
    byte_string = string['text'].numpy()

  # split and filter the string
  for string in tf.strings.split(byte_string, sep='\n'):
    decoded_string = string.numpy().decode('UTF-8')
    if ":" not in decoded_string and (decoded_string != ""):
      lines_from_shakespeare.append(decoded_string)
  
  return lines_from_shakespeare


def tokenize_pad_encode_text(lines_from_shakespeare):
  # convert the text into sequences
  sequences = shakespeare_tokenizer.texts_to_sequences(lines_from_shakespeare)

  # we are not training the model on the validation and test data so no needed
  # to generate more sequence from the initial set. We can generate more sequences
  # but this should suffice for now.

  # Apply padding to the sequences
  padded_sequences = pad_sequences(sequences, maxlen=15,
                                padding='pre', truncating='post')
  
  # convert the sequences into features and labels
  padded_sequences_features = padded_sequences[:,:-1]
  padded_sequences_labels = padded_sequences[:,-1]

  # one-hot encode the labels
  padded_sequences_one_hot_encoded_labels = tf.utils.to_categorical(padded_sequences_labels,
                                                                    num_classes=1000)
  return padded_sequences_features, padded_sequences_one_hot_encoded_labels



In [None]:
# prepare the validation and test model
# - test_text, validation_text
validation_lines_from_shakespeare = extract_text(validation_text)
validation_features, validation_labels = tokenize_pad_encode_text(validation_lines_from_shakespeare)


In [None]:
test_lines_from_shakespeare = extract_text(test_text)
test_features, test_labels = tokenize_pad_encode_text(test_lines_from_shakespeare)

### **Define Model**

For bants i'd like to create different types of model with varying architectures and see what the results are like

In [None]:
# Model training parameters
max_sequence_length = 15
vocabulary_size = 1000
embedding_dim = 17

#### **Define and compile a model with LSTM layers**

In [None]:
# define the model
Shakespeare_lSTM = tf.keras.Sequential([
                                        tf.keras.layers.Embedding(input_dim = vocabulary_size, output_dim= embedding_dim, input_length= max_sequence_length),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(13, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(21, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(34, return_sequences=False, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Dense(100, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(55, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(vocabulary_size, activation=tf.keras.activations.softmax)
])


In [None]:
# compile the model
Shakespeare_LSTM.compile(loss="categorical_crossentropy",
                         optimizer="adam",
                         metrics=["accuracy"])

#### **Define and compile a model with GRU layers**

In [None]:
# define the model
Shakespeare_GRU = tf.keras.Sequential([
                                        tf.keras.layers.Embedding(input_dim = vocabulary_size, output_dim= embedding_dim, input_length= max_sequence_length),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.GRU(13, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.GRU(21, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.GRU(34, return_sequences=False, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Dense(100, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(55, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(vocabulary_size, activation=tf.keras.activations.softmax)
])

In [None]:
# compile the model
Shakespeare_GRU.compile(loss="categorical_crossentropy",
                         optimizer="adam",
                         metrics=["accuracy"])

#### **Define and compile a model with SimpleRNN layers**

In [None]:
# define the model
Shakespeare_SimpleRNN = tf.keras.Sequential([
                                        tf.keras.layers.Embedding(input_dim = vocabulary_size, output_dim= embedding_dim, input_length= max_sequence_length),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(13, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(21, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(34, return_sequences=False, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Dense(100, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(55, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(vocabulary_size, activation=tf.keras.activations.softmax)
])

In [None]:
# compile the model
Shakespeare_SimpleRNN.compile(loss="categorical_crossentropy",
                         optimizer="adam",
                         metrics=["accuracy"])


#### **Define and compile a model with a combination of the above 3 layers**

In [None]:
# define the model
Shakespeare_Hybrid = tf.keras.Sequential([
                                        tf.keras.layers.Embedding(input_dim = vocabulary_size, output_dim= embedding_dim, input_length= max_sequence_length),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(13, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(21, return_sequences=True, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Bidirectional(tf.keras.layers.GRU(34, return_sequences=False, recurrent_dropout=0.5, dropout=0.5)),
                                        tf.keras.layers.Dense(100, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(55, activation=tf.keras.activations.relu),
                                        tf.keras.layers.Dense(vocabulary_size, activation=tf.keras.activations.softmax)
])

In [None]:
# compile the model
Shakespeare_Hybrid.compile(loss="categorical_crossentropy",
                         optimizer="adam",
                         metrics=["accuracy"])


### **Train Model**