## Lab 8, Part 2:   Recurrent Neural Networks (RNN)  -- Extra Credit

When it comes to model sequential data such as sentences, documents and videos, etc, the state of the art approach is to use Recurrent neural network (RNN). At each timestep, RNN takes an element (such as a word) as input, combines with past information encoded as a vector (such as all information in the sentence before this timestep), generate a new vector encoding both current input and past information, then delivers it to next timestep.

For more details about LSTM (a very popular variant of RNN), please refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/ and here is a very good video explaining RNN: https://www.youtube.com/watch?v=WCUNPb-5EYI.

### Generating text with Long Short-Term Memory Networks

RNN can be used to generate text. For more information, please read: https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The following is an example script to generate text from Nietzsche's writings.

Note: 
- At least 20 epochs are required before the generated text
starts sounding coherent.

- It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

- If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [1]:
#Import necessary libraries 
from __future__ import print_function
import tensorflow as tf
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import get_file
import numpy as np
import random
import sys
import io
tf.compat.v1.enable_eager_execution()

In [2]:
#Get the data - available from amazon
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 600893
total chars: 57


In [3]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('Done!')

nb sequences: 200285
Vectorization...
Done!


Now we have data to feed a model for text generation. Next  we build a LSTM model to fit the data. Using Keras this is only few lines of code!

In [4]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')

Build model...
Done!


In [6]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = ma.log(preds)
    preds = preds.filled(0)
    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

### Training (reduce the number of epochs, it takes a lot of time!!)
-  Each epoch takes 5-10 minutes or so on a CPU (an epoch took 7.5 minutes for my PC)
-  Recall that training on at least 20 epochs will give intelligible results 
-  So you're gonna have to let that puppy run for a while (2-3 hours)

In [6]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=4,
          callbacks=[print_callback])

Train on 200285 samples
Epoch 1/4
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "e ribs ready for the blind rage with whi"
e ribs ready for the blind rage with which is a man and propers of the hist the more and the spirit and the self the species and the species of the superse of the which the self-experient and provers of the discive and supers and the morality the spirits and propere the some still that it is a reare experience and a propere and the prospres and the prover of the stall and the self-experient and the self of the readous and in the concept
----- diversity: 0.5
----- Generating with seed: "e ribs ready for the blind rage with whi"
e ribs ready for the blind rage with which is a mivele are a mare and intention and the propent in there the mrans and in the discaring and the free the concession of the belief and a mast, and as a reading in this the conscience this the prourment and his all that which it nat, and nature of the distan

<tensorflow.python.keras.callbacks.History at 0x260fea7f988>

In [None]:
from numpy import *

## Load pre-trained model
Since it is time consuming to train this LSTM model with CPU for more epochs, we provided a pre-trained model which is trained on GPU for 100 epochs. Use the following code to check how coherency the model is.

It requires h5py packages, please install it to test the following code.

In [7]:
# build the model: a single LSTM
print('Load pre-trained model...')
from keras.models import load_model
model = load_model('shakespear200.h5')


def lstm_generate(seed, model):
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        generated += seed
        print('----- Generating with seed: "' + seed + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(seed):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            seed = seed[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


seed = "from an anguish with which no other is t"
# seed = "thou art"
lstm_generate(seed, model)


Load pre-trained model...
----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the subjections of the subjection of the subjection of the sense of the subjection of the such a particularly the subjection of the subjection of the subjection of the sense of the present the passes of the subjection of the sense of the so the subjection of the more to the subjection of all the like the present to the subjection of the states the subjection of the subjection of the subjection of t
----- diversity: 0.5
----- Generating with seed: "of the subjection of the subjection of t"
of the subjection of the subjection of the even about that the subjection of it the community when he was the subjection of the to so far about the seen there is a mustic come the like the confiding and the
children--and such the more the particularly, what is to the probably the forces of the so far of such the present and the sense to the nature

Using TensorFlow backend.


### Exercise: try it to generate baby names
-  The baby name data set contains 8000 names. You can download and process the name data set as follows:

``` python
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 
    
text = text.split()
text = ', '.join(text)
```

Using the baby name data set, answer the following tasks:

- Train a LSTM to generate the baby names.
- How long does it take to train? How coherent does it sound? 
- Can you train the LSTM, but for every epoch, shuffle the order of names before call model.fit()? How long does it take to train? Does it improve the coherency?

In [2]:
#need future import print function?
import tensorflow as tf
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import get_file
import numpy as np
import random
import sys
import io
tf.compat.v1.enable_eager_execution()
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    raw_text = f.read() # make it all lowercase

split_text = raw_text.split()
joined_text = ', '.join(split_text)

text= joined_text # for on_epoch_end

In [3]:
def vectorize_data_into_x_and_y(text):
    chars = sorted(list(set(text)))
    print('total chars:', len(chars))
    char_indices = dict((c, i) for i, c in enumerate(chars))
    indices_char = dict((i, c) for i, c in enumerate(chars))
    
    
    maxlen = 25
    step = 2
    sentences = []
    next_chars = []
    for i in range(0, len(text) - maxlen, step):
        sentences.append(text[i: i + maxlen])
        next_chars.append(text[i + maxlen])
    print('nb sequences:', len(sentences))
    
    # Turn these sentances into one-hot encoded vectors
    ## For all words in the sentances, there is a one, else there is a zero in that index of the vector
    
    print('Vectorization...')
    x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
    y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t, char in enumerate(sentence):
            x[i, t, char_indices[char]] = 1
        y[i, char_indices[next_chars[i]]] = 1
    print('Done!')
    
    return x, y, maxlen, chars, char_indices, indices_char

x, y, maxlen, chars, char_indices, indices_char = vectorize_data_into_x_and_y(joined_text)

total chars: 58
nb sequences: 250882
Vectorization...
Done!


In [4]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print('Done!')



Build model...
Done!


In [7]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [8]:
model.fit(x, y,
          batch_size=128,
          epochs=8,
          callbacks=[print_callback],
          use_multiprocessing=True,
          workers=16)


Train on 250882 samples
Epoch 1/8
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: ", Debera, Debernardi, Deb"
, Debera, Debernardi, Deber, Deber, Deber, Deber, Deber, Deber, Deber, Deberi, Deber, Deber, Debin, Debing, Debing, Debinge, Debing, Debinge, Debine, Debine, Debing, Debinge, Debing, Debins, Debins, Debinger, Debinger, Debin, Debing, Debinger, Debinger, Debins, Debing, Debinge, Debinge, Debinger, Debins, Debinge, Debine, Debing, Debin, Debing, Debing, Debinger, Debine, Debine, Debins, Debinge, Debin, Debing, Debing, Debin
----- diversity: 0.5
----- Generating with seed: ", Debera, Debernardi, Deb"
, Debera, Debernardi, Debale, Deballi, Debal, Deban, Debale, Deban, Debalin, Debanich, Debarin, Debato, Debas, Decaman, Decaman, Decan, Deca, Deca, Decara, Decard, Decarico, Decaria, Deca, Deca, Deca, Decace, Decas, Deca, Decacz, Decana, Decane, Decano, Decance, Decan, Decane, Decare, Decaria, Decari, Decar, Deca, Decano, Decane, Decard, Decari, D

<tensorflow.python.keras.callbacks.History at 0x29de2f48208>

Training the baby-name-generator took 895 seconds for 8 epochs, around 112 seconds per epoch.
I'm not sure if it became more coherent/realistic over this stretch of training because most of the baby names in the training data sound so strange to me that I'm not confident about judging how different one of the network's generated names is from the given dataset of baby names.

In [None]:
# import keras.utils
# import keras.utils.data_utils
#
# class ShufflePerEpochSequence(keras.utils.data_utils.Sequence):
#     def __init__(self, split_text, batch_size):
#         self.split_text = split_text
#         self.batch_size = batch_size
#         self.text = ', '.join(self.split_text)
#         x, y, maxlen, chars, char_indices, indices_char = vectorize_data_into_x_and_y(self.text)
#         self.x = x
#         self.y = y
#         self.epoch= 0
#
#
#     def __len__(self):
#         return int(np.ceil( len(self.x) / float(self.batch_size) ))
#
#
#     def __getitem__(self, index):
#         batch_x = self.x[index * self.batch_size:(index + 1) * self.batch_size]
#         batch_y = self.y[index * self.batch_size:(index + 1) * self.batch_size]
#
#         return np.array(batch_x), np.array(batch_y)
#
#
#     #todo copy in vectorize_data... function?
#     def __vectorize_data_into_x_and_y(self, text):
#         chars = sorted(list(set(text)))
#         print('total chars:', len(chars))
#         char_indices = dict((c, i) for i, c in enumerate(chars))
#         indices_char = dict((i, c) for i, c in enumerate(chars))
#
#
#         maxlen = 25
#         step = 2
#         sentences = []
#         next_chars = []
#         for i in range(0, len(text) - maxlen, step):
#             sentences.append(text[i: i + maxlen])
#             next_chars.append(text[i + maxlen])
#         print('nb sequences:', len(sentences))
#
#         # Turn these sentances into one-hot encoded vectors
#         ## For all words in the sentances, there is a one, else there is a zero in that index of the vector
#
#         print('Vectorization...')
#         x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
#         y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
#         for i, sentence in enumerate(sentences):
#             for t, char in enumerate(sentence):
#                 x[i, t, char_indices[char]] = 1
#             y[i, char_indices[next_chars[i]]] = 1
#         print('Done!')
#
#         return x, y, maxlen, chars, char_indices, indices_char
#
#     def __sample(self, preds, temperature=1.0):
#         # helper function to sample an index from a probability array
#         preds = np.asarray(preds).astype('float64')
#         preds = np.log(preds) / temperature
#         exp_preds = np.exp(preds)
#         preds = exp_preds / np.sum(exp_preds)
#         probas = np.random.multinomial(1, preds, 1)
#         return np.argmax(probas)
#
#     def on_epoch_end(self):
#         shuffled_split_text = random.shuffle(self.split_text)
#         self.text = ', '.join(shuffled_split_text)
#         x, y, maxlen, chars, char_indices, indices_char = self.__vectorize_data_into_x_and_y(self.text)
#         self.x = x
#         self.y = y
#
#         print()
#         print('----- Generating text after Epoch: %d' % self.epoch)
#
#         start_index = random.randint(0, len(self.text) - maxlen - 1)
#         for diversity in [0.2, 0.5, 1.0, 1.2]:
#             print('----- diversity:', diversity)
#
#             generated = ''
#             sentence = self.text[start_index: start_index + maxlen]
#             generated += sentence
#             print('----- Generating with seed: "' + sentence + '"')
#             sys.stdout.write(generated)
#
#             for i in range(400):
#                 x_pred = np.zeros((1, maxlen, len(chars)))
#                 for t, char in enumerate(sentence):
#                     x_pred[0, t, char_indices[char]] = 1.
#
#                 preds = model.predict(x_pred, verbose=0)[0]
#                 next_index = self.__sample(preds, diversity)
#                 next_char = indices_char[next_index]
#
#                 generated += next_char
#                 sentence = sentence[1:] + next_char
#
#                 sys.stdout.write(next_char)
#                 sys.stdout.flush()
#             print()
#
#         self.epoch += 1


In [None]:
# shufflingDataSequence = ShufflePerEpochSequence(split_text, 128)
# print(isinstance(shufflingDataSequence, keras.utils.Sequence))
# print(isinstance(shufflingDataSequence, keras.utils.data_utils.Sequence))

In [9]:
# model.run_eagerly = True
# model._experimental_run_tf_function = (False)
# model.fit_generator(generator=shufflingDataSequence, epochs=8) #,
          # use_multiprocessingssing=True,
          # workers=16)
model.fit(x, y,
          batch_size=128,
          epochs=8,
          callbacks=[print_callback],
          use_multiprocessing=True,
          workers=16,
          shuffle=True)


Train on 250882 samples
Epoch 1/8
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "ush, Furby, Furches, Fure"
ush, Furby, Furches, Fure, Furen, Furent, Furer, Furger, Furger, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgert, Furger, Furger, Furgers, Furgers, Furgers, Furges, Furgers, Furger, Furgers, Furges, Furges, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furgers, Furges, Furges, Furges, Furges, Furger, Furgers, Furgers, Furgers, Fu
----- diversity: 0.5
----- Generating with seed: "ush, Furby, Furches, Fure"
ush, Furby, Furches, Fure, Furek, Furk, Furker, Furkerts, Furking, Furks, Furl, Furland, Furley, Furlon, Furoll, Furold, Furrie, Furritt, Furroy, Furry, Furt, Furte, Furten, Furter, Furter, Furti, Furunde, Furus, Furwell, Fury, Fus, Fuske, Fuskin, Fusler, Fuster, Fuster, Fusterman, Fusterman, Fusters, Fusterman, Fusters, Fuster, Fuster, Fusterman,

<tensorflow.python.keras.callbacks.History at 0x29de4109248>

Adding the data-shuffling between epochs increases the training time to 986 seconds for 8 epochs, around 123 seconds per epoch.
As before, I'm not sure if it became more coherent/realistic with shuffling after each epoch of training because most of the baby names in the training data sound so strange to me that I'm not confident about judging how different one of the network's generated names is from the given dataset of baby names.


