Recurrent Neural Networks (RNN): Simple RNN
=========

In this tutorial we will learn how to use Recurrent Neural Networks (RNNs) for text processing.
The example uses a simple RNN to generate the characters in a small text corpus (Alice in Wonderland novel).


1 - Simple RNNs to Generate text
=========

This exmple will use a simple RNN with one layer of hidden and recurrent units.

Let's first import the usual keras and utility functions. This will include the first importing and use of the SimpleRNN recurrent layer type in Keras.


In [2]:
from __future__ import print_function
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.models import Sequential
import numpy as np


**Loading text file and converting to clean text**

This code will read the file "alice_in_wonderland.txt". This file is available for download here http://www.gutenberg.org/files/11/11-0.txt (and can also be downloaded from Moodle).

The code will do some preliminary cleanup of the text (e.g. removing non-ASCII characters and line breaks) and write all words in the variable called "whole_text".


In [4]:
print("Loading text file...")

INPUT_FILE = "alice_in_wonderland.txt"

fin = open(INPUT_FILE, 'rb')
lines = []
for line in fin:
    line = line.strip().lower()
    line = line.decode("ascii", "ignore")
    if len(line) == 0:
        continue
    lines.append(line)
fin.close()
whole_text = " ".join(lines)

print("Done!")


Loading text file...
Done!


**Characters look-up table**

This code will create the look-up table from the 42 characters to integer IDs, and viceversa.

Example of the dictionaries

char2index:
{
    'a': 0,
    'b': 1,
    'c': 2,
    ...
}

index2char:
{
    0: 'a',
    1: 'b',
    2: 'c',
    ...
}

In [5]:
print("Preparing characters look-up table...")

chars = set([c for c in whole_text])  # iterate over each character in the whole_text string and construct a list of unique characters
nb_chars = len(chars)  # calculates the number of unique characters in the chars set
char2index = dict((c, i) for i, c in enumerate(chars))  # a dictionary that maps each unique character to its corresponding index in the chars set.
index2char = dict((i, c) for i, c in enumerate(chars))  # a dictionary that maps each index to its corresponding character in the chars set.

print("Done!")

Preparing characters look-up table...
Done!


**Input text and output character**

To create the input text, the code will step through the whole text by a number of characters defined by the __STEP__ variable (1 in our case) and then extract a set of characters whose size is determined by the __SEQLEN__ variable (10 in our case). The next character after the extracted characters is the output label, i,e. the next character to predict.

Here is an example of the input/output for the part of text starting as" it turned into a pig"

   INPUT (10)    ->   OUTPUT (1)
- "it turned "   ->   i
- "t turned i"   ->   n
- " turned in"   ->   t
- "turned int"   ->   o
- "urned into"   ->
- "rned into "   ->   a
- "ned into a"   ->
- "ed into a "   ->   p
- "d into a p"   ->   i
- " into a pi"   ->   g


In [9]:
# Prepare the training data for a character-based sequence prediction task.

print("Creating input and label text...")

SEQLEN = 10 # length of input sentence, for each training sample, the model will take a sequence of 10 characters as input (including spaces)
STEP = 1  # shift of the starting point of each sequence, as new sequences are created. Sequences will be created with a one-character overlap.

input_chars = []
label_chars = []
for i in range(0, len(whole_text) - SEQLEN, STEP): # generating the sequences in the entire text (string)
    input_chars.append(whole_text[i:i + SEQLEN]) # the input to the model, a sequence of characters of length 10 starting from index i
    label_chars.append(whole_text[i + SEQLEN]) # the character immediately following the input sequence - the output label that the model should predict.

print("Done!")

Creating input and label text...
Done!


**Vectorisation of the input and output**

This preprares the input and output vectors of the training set.
The input vector uses one-hot encoding of the __SEQLEN__ (10) characters present in the input text segment, times the __nb_chars__ (42) possible characters.
The output label uses one-hot encoding for the activation of a single character out of the __nb_chars__ number of units/characters.


Example of one-hot encoding of the input sequence "it turned "

Input Sequence: "it turned"

X:
[
    [
        [0, 1, 0, ..., 0],  # one-hot encoding for 'i'
        [0, 0, 1, ..., 0],  # one-hot encoding for 't'
        [0, 0, 0, ..., 0],  # one-hot encoding for ' '
        [0, 0, 0, ..., 0],  # one-hot encoding for 't'
        [0, 0, 0, ..., 0],  # one-hot encoding for 'u'
        [0, 0, 0, ..., 0],  # one-hot encoding for 'r'
        [0, 0, 0, ..., 0],  # one-hot encoding for 'n'
        [0, 0, 0, ..., 0],  # one-hot encoding for 'e'
        [0, 0, 0, ..., 0],  # one-hot encoding for 'd'
        [0, 0, 0, ..., 0],  # one-hot encoding for ' '
    ]
]

y:
[
      [0, 1, 0, ..., 0],  # one-hot encoding for 'i'
]

nb_chars is the number of unique characters in the text

In [11]:
# Prepare the input data (X) and the corresponding labels (y) for training the RNN

print("Vectorizing input and label text...")

X = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=bool) # initialise an array X to store the input data
y = np.zeros((len(input_chars), nb_chars), dtype=bool)  # initialise an array y to store the corresponding labels
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char): # iterate over each input sequence and its corresponding characters
        X[i, j, char2index[ch]] =  1 # an input sequence, each column corresponding to a character in the sequence
    y[i, char2index[label_chars[i]]] = 1  # represents the one-hot encoding of the label (desired predicted character)

print("Done!")

Vectorizing input and label text...
Done!


**Simple RNN Model definition**

This codes defines the key variables of the model and training, and creates the sequential model. The model consists of a simple recurrent neural network (RNN) with a hidden layer of 128 simple recurrent units.

The __return_sequences__ is set to __False__ as the output only consists of one character, and not a sequence.
The __unroll=True__ setting improves performance on the TensorFlow backend.
The optimiser uses the __rmsprop__ for backpropagation.



In [12]:
# Definition of the network and training hyperparameters
HIDDEN_SIZE = 128
BATCH_SIZE = 128  # number of samples that will be propagated through the network together
NUM_ITERATIONS = 55 # number of iterations (or epochs) for training the model
NUM_EPOCHS_PER_ITERATION = 1  # for each iteration (or epoch), the model will be trained on the entire dataset once
NUM_PREDS_PER_EPOCH = 100

# Definition of the network topology, with simpleRNN hidden layer
model = Sequential()
model.add(SimpleRNN(HIDDEN_SIZE, return_sequences=False, input_shape=(SEQLEN, nb_chars), unroll=True))  # only the output of the last timestep will be returned
model.add(Dense(nb_chars)) #fully connected layer: maps the output of the SimpleRNN layer to the output space, where each unit corresponds to a unique character.
model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

#show the model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_1 (SimpleRNN)    (None, 128)               23552     
                                                                 
 dense_1 (Dense)             (None, 55)                7095      
                                                                 
 activation_1 (Activation)   (None, 55)                0         
                                                                 
Total params: 30647 (119.71 KB)
Trainable params: 30647 (119.71 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


**Training and testing of the model**

The model is trained for __NUM_ITERATIONS__ (25) epochs and tested after each epoch, to allow us to monitor the improvement of the model performance in character prediction.

The test consists of generating a character from the model given a random input, then dropping the first character from the input and appending the predicted character from the previous run as the new input, to generate another character. This is done for __NUM_PREDS_PER_EPOCH__ (100) steps. The completed string gives us an indication of the quality of the model's processing of English words (within the limited lexicon of Alice's novel).


In [13]:
for iteration in range(NUM_ITERATIONS):
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION)

    # testing model
    # randomly choose a row from input_chars, then use it to
    # generate text from model for next 100 chars
    test_idx = np.random.randint(len(input_chars))
    test_chars = input_chars[test_idx]
    print("Generating from seed: %s" % (test_chars))
    print(test_chars, end="")
    for i in range(NUM_PREDS_PER_EPOCH):
        Xtest = np.zeros((1, SEQLEN, nb_chars))
        for i, ch in enumerate(test_chars):
            Xtest[0, i, char2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]
        ypred = index2char[np.argmax(pred)]
        print(ypred, end="")
        # move forward with test_chars + ypred
        test_chars = test_chars[1:] + ypred
    print()


Iteration #: 0
Generating from seed: ime withou
ime withou the the the the the the the the the the the the the the the the the the the the the the the the the
Iteration #: 1
Generating from seed: ething was
ething was she warke was she warke was she warke was she warke was she warke was she warke was she warke was s
Iteration #: 2
Generating from seed:  rather no
 rather nowe the was a she had and the mad the dithe said the dore the doon the was a she had and the mad the 
Iteration #: 3
Generating from seed: e. it must
e. it must in a dore the made the made the made the made the made the made the made the made the made the made
Iteration #: 4
Generating from seed: , said the
, said the gryphon a dont the moute be the grown the mack to the grean seat a to the grean seat a to the grean
Iteration #: 5
Generating from seed:  a bit. pe
 a bit. persed the formous to the king and a dit her a don it was the did the did the did the did the did the 
Iteration #: 6
Generating from seed: n her hea

Explanation of the training process:

*Input Encoding*: First, the model takes an input sequence of characters. These characters need to be preprocessed and one-hot encoded.

*Forward Pass*: The encoded input sequence is fed into the model. It goes through the layers of the model, including the SimpleRNN layer and the Dense layer with softmax activation.

*Output Prediction*: After passing through the layers, the model produces an output vector with the predicted probabilities for each character in the output vocabulary. The softmax activation ensures that these probabilities sum up to 1.

*Generating Next Character*: Once a character is generated (higher probability), it is added to the input sequence, and the process is repeated until the end-of-sequence token is reached (100 predictions).




**Conclusion**: This example shows how we can train a Simple RNN to predict the next charaters, using the information from the sequentail relationship between words in the training text.
You can try the same code on a different, longer text, or try different hyperparameters.

**Copyright (c)** 2024 Code and examples adapted from Gulli & Pal (2017) Deep Learning with Keras. Punkt Publishing.