# <span style="color:#0b486b">  FIT3181: Deep Learning (2022)</span>
***
*CE/Lecturer (Clayton):*  **Dr Trung Le** | trunglm@monash.edu <br/>
*Lecturer (Malaysia):*  **Dr Lim Chern Hong** | lim.chernhong@monash.edu <br/>  <br/>
*Tutor:*  **Mr Thanh Nguyen** \[Thanh.Nguyen4@monash.edu \] |**Mr Tuan Nguyen**  \[tuan.ng@monash.edu \] |**Mr Anh Bui** \[tuananh.bui@monash.edu\] | **Dr Binh Nguyen** \[binh.nguyen1@monash.edu \] | **Mr Md Mohaimenuzzaman** \[md.mohaimen@monash.edu \] |**Mr James Tong** \[james.tong1@monash.edu \]
<br/> <br/>
Faculty of Information Technology, Monash University, Australia
***on Technology, Monash University, Australia
******

# <span style="color:#0b486b">Tutorial 08c (Additional Reading): RNN for Text Generation</span> <span style="color:red">***</span> #

This tutorial is designed to show one of the applications of RNN in generating texts or sequences. Basically, we train an RNN using the maximum log-likelihood principle and then use this trained RNN to generate texts that imitate the existed texts in the dataset we trained our RNN on.

We first import the necessary modules.

## <span style="color:#0b486b">I. Download and preprocess data</span> ##

In [1]:
import os
import numpy as np
import re
import shutil
import tensorflow as tf

In [2]:
DATA_DIR = "."
CHECKPOINT_DIR = os.path.join(DATA_DIR, "checkpoints")
if not os.path.exists(CHECKPOINT_DIR):
    os.mkdir(CHECKPOINT_DIR)

The below function helps to download the dataset at a specific URL and split the sentences into characters.  

In [3]:
def download_and_read(urls):
    texts = []
    for i, url in enumerate(urls):
        p = tf.keras.utils.get_file("ex1-{:d}.txt".format(i), url, cache_dir=".")
        text = open(p, "r", encoding="utf8").read()
        # remove byte order mark
        text = text.replace("\ufeff", "")
        # remove newlines
        text = text.replace('\n', ' ')
        text = re.sub(r'\s+', " ", text)
        # add it to the list
        texts.extend(text)
    return texts

We download the dataset and the variable *texts* is a list containing all characters of the sentences in this dataset.

In [4]:
texts = download_and_read(["http://www.gutenberg.org/cache/epub/28885/pg28885.txt", "https://www.gutenberg.org/files/12/12-0.txt"])

In [5]:
print(texts[0:100])

['P', 'r', 'o', 'j', 'e', 'c', 't', ' ', 'G', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g', "'", 's', ' ', 'A', 'l', 'i', 'c', 'e', "'", 's', ' ', 'A', 'd', 'v', 'e', 'n', 't', 'u', 'r', 'e', 's', ' ', 'i', 'n', ' ', 'W', 'o', 'n', 'd', 'e', 'r', 'l', 'a', 'n', 'd', ',', ' ', 'b', 'y', ' ', 'L', 'e', 'w', 'i', 's', ' ', 'C', 'a', 'r', 'r', 'o', 'l', 'l', ' ', 'T', 'h', 'i', 's', ' ', 'e', 'B', 'o', 'o', 'k', ' ', 'i', 's', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'u', 's', 'e', ' ', 'o', 'f', ' ']


We extract the vocabulary of all unique characters in this dataset and store in *vocab*. In addition, we have two dictionaries: *char2idx* and *idx2char* to convert between the characters and their indices.

In [6]:
# create the vocabulary
vocab = sorted(set(texts))
print("vocab size: {:d}".format(len(vocab)))
# create mapping from vocab chars to ints
char2idx = {c:i for i, c in enumerate(vocab)}
idx2char = {i:c for c, i in char2idx.items()}

vocab size: 90


We transform the characters in *texts* to the indices in *texts_as_ints* and then make a Tensorflow dataset *data* from this *texts_as_ints*. Finally, we chop *data* into batch dataset *sequences*.

In [7]:
# numericize the texts
texts_as_ints = np.array([char2idx[c] for c in texts])
data = tf.data.Dataset.from_tensor_slices(texts_as_ints)
# number of characters to show before asking for prediction
# sequences: [None, 100]
seq_length = 100
sequences = data.batch(seq_length + 1, drop_remainder=True)

For the below function, you can imagine *sequence* is a batch of characters, for example \['I', 'l', 'o', 'v', 'e', 'D', 'L'\], this function will return \['I', 'l', 'o', 'v', 'e', 'D'\] and \['l', 'o', 'v', 'e', 'D', 'L'\].

The idea later is that we feed \['I', 'l', 'o', 'v', 'e', 'D'\] to our RNN and try to predict \['l', 'o', 'v', 'e', 'D', 'L'\] which is the set of next characters.

In [8]:
def split_train_labels(sequence):
    input_seq = sequence[0:-1]
    output_seq = sequence[1:]
    return input_seq, output_seq

We now apply the function *split_train_labels* to each batch in sequences.

In [9]:
sequences = sequences.map(split_train_labels)
# set up for training
# batches: [None, 64, 100]
batch_size = 64
steps_per_epoch = len(texts) // seq_length // batch_size
dataset = sequences.shuffle(10000).batch(batch_size, drop_remainder=True)

We encapsulate our generation model in the class *CharGenModel*. Our model has one embedding layer and one hidden layer with GRU cells. Note that we need to set *return_sequences=True* for the hidden layer so that it returns a 3D tensor of all hidden values.

In [10]:
class CharGenModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, **kwargs):
        super(CharGenModel, self).__init__(**kwargs)
        self.embedding_layer = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.rnn_layer = tf.keras.layers.GRU(embedding_dim, recurrent_initializer="glorot_uniform", recurrent_activation="sigmoid", 
                                             stateful=True, return_sequences=True)
        self.dense_layer = tf.keras.layers.Dense(vocab_size)
    
    def call(self, x):
        x = self.embedding_layer(x)
        x = self.rnn_layer(x)
        x = self.dense_layer(x)
        return x

We build the model.

In [11]:
vocab_size = len(vocab)
embedding_dim = 256
rnn_output_dim = 1024

In [12]:
model = CharGenModel(vocab_size, embedding_dim)
model.build(input_shape=(batch_size, None))

We define the loss function which is the sum of the loss at each time step. 

In [13]:
def loss(labels, predictions):
    return tf.losses.sparse_categorical_crossentropy(labels,predictions,from_logits=True)

In [14]:
model.compile(optimizer=tf.optimizers.Adam(), loss=loss)

To generate a text, we start from a prefix_string. We convert this string to a list of indices and declare a 2D tensor from this list with the first dimension to be $1$. We feed *inputs* to the model to work out the prediction probability *preds* and sample *pred_id* from this probability and so on. 

In [15]:
def generate_text(model, prefix_string, char2idx, idx2char, num_chars_to_generate=1000, temperature=1.0):
    inputs = [char2idx[s] for s in prefix_string]
    inputs = tf.expand_dims(inputs, 0)
    text_generated = []
    model.reset_states()
    for i in range(num_chars_to_generate):
        preds = model(inputs)
        preds = tf.squeeze(preds, 0) / temperature
        # predict char returned by model
        pred_id = tf.random.categorical(preds, num_samples=1)[-1, 0].numpy()
        text_generated.append(idx2char[pred_id])
        # pass the prediction as the next input to the model
        inputs = tf.expand_dims([pred_id], 0)
    return prefix_string + "".join(text_generated)

In [16]:
import logging
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # FATAL
logging.getLogger('tensorflow').setLevel(logging.FATAL)

num_epochs = 50
for i in range(num_epochs // 10):
    model.fit(dataset.repeat(), epochs=10, steps_per_epoch=steps_per_epoch)
    checkpoint_file = os.path.join(CHECKPOINT_DIR, "model_epoch_{:d}".format(i+1))
    model.save_weights(checkpoint_file)
    gen_model = CharGenModel(vocab_size, embedding_dim)
    gen_model.load_weights(checkpoint_file)
    gen_model.build(input_shape=(1, None))
    # create generative model using the trained model so far
    print(generate_text(gen_model, "Alice ", char2idx, idx2char))
    print("---")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Alice Adies turned with no agoors, in this for ell full be igick/dly much, sto looks in round agstly dooch the Firsty, you for his thooking thought of the lastle being the other veris, and undo to etenct but replect of the Pigrantely, ‘Wown sud into camen thinct Gutence,’ ‘All she its the is," said ithe beant the DO*” she thind dreat again ofxing of pleas head--they: ‘You’d tay mp anx be manate, Twith pactive. I. He said at herontwards of it mane gettered it way dearts!’ ‘Then she Cat again so the was that, theme time sour the subleered got elon. "Yellave the could!’ ‘she had pootembthe upon’t for_ of watchistly!" ‘I his ligute to rush says betien at the which just dwourdany _I't_ poout the Gutenberg-ty justed n. ‘Revera'k! ‘And the crill. I'm YOU, and she causly. "I dear was my,’ she time anly thempity exco into to smake, Let* ‘Cere won by the said to for ase Se, "You’rt stid

---
### <span style="color:#0b486b"> <div  style="text-align:center">**THE END**</div> </span>