# Text generation using a RNN

Given a sequence of words from this data, train a model to predict the next word in the sequence. Longer sequences of text can be generated by calling the model repeatedly.

**Mount your Google Drive**

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [0]:
import os
os.chdir('/content/drive/My Drive/data')

### Import Keras and other libraries

In [4]:
import glob

from sklearn.utils import shuffle
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam
from keras import backend

Using TensorFlow backend.


## Download data
Reference: Data is collected from http://www.gutenberg.org

For the lab purpose, you can load the dataset provided by Great Learning

### Load the Oscar Wilde dataset

Store all the ".txt" file names in a list

In [5]:
OWlist = glob.glob("./*.txt")
OWlist

['./A Critic in Pall Mall.txt',
 './A House of Pomegranates.txt',
 './A Woman of No Importance a play.txt',
 './An Ideal Husband.txt',
 './Impressions of America.txt',
 './For Love of the King.txt',
 './Children in Prison and Other Cruelties of Prison Life.txt',
 './Essays and Lectures.txt',
 './De Profundis.txt',
 './Charmides and Other Poems.txt',
 './Intentions.txt',
 './Lady Windermere_s Fan.txt',
 './Oscar Wilde Miscellaneous.txt',
 './Reviews.txt',
 './Lord Arthur Savile_s Crime.txt',
 './Miscellaneous Aphorisms_ The Soul of Man.txt',
 './Poems with the Ballad of Reading Gaol.txt',
 './Rose Leaf and Apple Leaf.txt',
 './Salomé A tragedy in one act.txt',
 './Miscellanies.txt',
 './Selected poems of oscar wilde including The Ballad of Reading Gaol.txt',
 './Selected prose of oscar wilde with a Preface by Robert Ross.txt',
 './Shorter Prose Pieces.txt',
 './The Ballad of Reading Gaol.txt',
 './The Happy Prince and other tales.txt',
 './The Canterville Ghost.txt',
 './The Soul of Ma

### Read the data

Read contents of every file from the list and append the text in a new list

In [0]:
codetext = []
bookranges = []
for OWfile in OWlist:
    OWtext = open(OWfile, "r")
    start = len(codetext)
    codetext.append(OWtext.read())
    end = len(codetext)
    bookranges.append({"start": start, "end": end, "name": OWfile.rsplit("/", 1)[-1]})
    OWtext.close()

## Process the text
Initialize and fit the tokenizer

In [0]:
tokenizer = Tokenizer(lower=True, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(codetext)

### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping words to numbers, and another for numbers to words.

In [0]:
word_idx = tokenizer.word_index
idx_word = tokenizer.index_word

Get the word count for every word and also get the total number of words.

In [0]:
word_counts = tokenizer.word_counts
num_words = len(word_idx) + 1

Convert text to sequence of numbers

In [0]:
sequences = tokenizer.texts_to_sequences(codetext)

In [11]:
sequences

[[3244,
  52,
  47,
  365,
  5,
  533,
  6,
  490,
  535,
  26,
  656,
  474,
  2290,
  26,
  157,
  2291,
  11349,
  27,
  365,
  7,
  14,
  1,
  162,
  2,
  624,
  1079,
  25,
  36,
  1058,
  3,
  13,
  317,
  36,
  1732,
  1145,
  11,
  72,
  273,
  10,
  142,
  10,
  154,
  21,
  1080,
  162,
  10,
  184,
  1,
  172,
  2,
  1,
  52,
  47,
  239,
  1027,
  13,
  27,
  365,
  21,
  906,
  25,
  413,
  47,
  355,
  1081,
  5,
  533,
  6,
  490,
  535,
  607,
  656,
  474,
  2061,
  157,
  2291,
  11349,
  2490,
  863,
  2598,
  895,
  15779,
  365,
  11350,
  628,
  213,
  471,
  200,
  3931,
  4824,
  570,
  938,
  2,
  1,
  52,
  47,
  365,
  5,
  533,
  6,
  490,
  535,
  3829,
  28,
  1,
  11351,
  2133,
  698,
  3531,
  483,
  26,
  1856,
  982,
  1127,
  3932,
  916,
  355,
  5,
  533,
  6,
  490,
  535,
  139,
  5214,
  28,
  4475,
  3,
  7600,
  26,
  656,
  474,
  2133,
  698,
  3531,
  4476,
  5215,
  594,
  1043,
  427,
  396,
  127,
  781,
  6,
  11351,
  27,
  1857,
  42,

### Generate Features and Labels

In [0]:
features = []
labels = []

training_length = 50
# Iterate through the sequences of tokens
for seq in sequences:
    # Create multiple training examples from each sequence
    for i in range(training_length, training_length+300):
        # Extract the features and label
        extract = seq[i - training_length: i - training_length + 20]

        # Set the features and label
        features.append(extract[:-1])
        labels.append(extract[-1])

### The prediction task

Given a word, or a sequence of words, what is the most probable next word? This is the task we're training the model to perform. The input to the model will be a sequence of words, and we train the model to predict the output—the following word at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the words computed until this moment, what is the next word?

In [0]:
from sklearn.utils import shuffle
import numpy as np

features, labels = shuffle(features, labels, random_state=1)

# Decide on number of samples for training
train_end = int(0.7 * len(labels))

train_features = np.array(features[:train_end])
valid_features = np.array(features[train_end:])

train_labels = labels[:train_end]
valid_labels = labels[train_end:]

# Convert to arrays
X_train, X_valid = np.array(train_features), np.array(valid_features)

# Using int8 for memory savings
y_train = np.zeros((len(train_labels), num_words), dtype=np.int8)
y_valid = np.zeros((len(valid_labels), num_words), dtype=np.int8)

# One hot encoding of labels
for example_index, word_index in enumerate(train_labels):
    y_train[example_index, word_index] = 1

for example_index, word_index in enumerate(valid_labels):
    y_valid[example_index, word_index] = 1

In [56]:
X_train[0:5]

array([[  129,    23,    35,   558,  7116,  1694,    23,     1,  6651,
          664,    45,  6844,    75,   156,  1331,  8941,   293,   915,
           95],
       [   26, 17547, 17548,   696, 12041,     3,     1,   906,   884,
         7407,  5385,    25,   836,   413,  7965,   997,    27,  1345,
           18],
       [20954,     2,   427,   268,   268,     1,  7087,   263,   127,
         1349,    23,     1,  1081,   297,     2,     1,  3425,   483,
           10],
       [ 2299,     2,     1,   271,    60,   407,   164,   601,  9470,
           60,  2524,  8624,    66,  2486,   600,  2717,     1, 17300,
         4408],
       [  937,    14,  1872,   837, 15593,     3,   781,    26,    35,
           25,  5433,  4307,   594, 12998,  9360, 20637,  3712,  7067,
           26]])

This is just to check the features and labels

In [14]:
for i, sequence in enumerate(X_train[:2]):
    text = []
    print(i, sequence)
    for idx in sequence:
      text.append(idx_word[idx])
      print('Features: ' + ' '.join(text)+'\n')
      print('Label: ' + idx_word[np.argmax(y_train[i])] + '\n')

0 [ 129   23   35  558 7116 1694   23    1 6651  664   45 6844   75  156
 1331 8941  293  915   95]
Features: down

Label: she

Features: down on

Label: she

Features: down on him

Label: she

Features: down on him making

Label: she

Features: down on him making satirical

Label: she

Features: down on him making satirical remarks

Label: she

Features: down on him making satirical remarks on

Label: she

Features: down on him making satirical remarks on the

Label: she

Features: down on him making satirical remarks on the photographs

Label: she

Features: down on him making satirical remarks on the photographs suddenly

Label: she

Features: down on him making satirical remarks on the photographs suddenly there

Label: she

Features: down on him making satirical remarks on the photographs suddenly there leaped

Label: she

Features: down on him making satirical remarks on the photographs suddenly there leaped out

Label: she

Features: down on him making satirical remarks on the p

## Build The Model

Use `keras.Sequential` to define the model. For this simple example three layers are used to define our model:

* `keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `keras.layers.LSTM`: A type of RNN with size `units=rnn_units` (You can also use a GRU layer here.)
* `keras.layers.Dense`: The output layer, with `num_words` outputs.

In [15]:
import warnings
warnings.filterwarnings('always')  # "error", "ignore", "always", "default", "module" or "once"

model = Sequential()

# Embedding layer
model.add(
    Embedding(
        input_dim=len(word_idx) + 1,
        output_dim=100,
        weights=None,
        trainable=True))

# Recurrent layer
model.add(
    LSTM(
        64, return_sequences=False, dropout=0.1,
        recurrent_dropout=0.1))

# Fully connected layer
model.add(Dense(64, activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(num_words, activation='softmax'))

# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

W0714 14:52:40.073136 140024136632192 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0714 14:52:40.114820 140024136632192 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0714 14:52:40.123827 140024136632192 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0714 14:52:40.240015 140024136632192 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0714 14:52:40.254926 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 100)         3283900   
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32839)             2134535   
Total params: 5,464,835
Trainable params: 5,464,835
Non-trainable params: 0
_________________________________________________________________


For each word the model looks up the embedding, runs the LSTM one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-liklihood of the next word.

## Train the model

In [14]:
h = model.fit(X_train, y_train, epochs = 50, batch_size = 50, verbose = 1)## Train the model

W0714 09:26:53.879811 140249592121216 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


### Save Model

In [0]:
# save the model to file
model.save('./model_50epochs.h5')

## If you have already trained the model and saved it, you can load a pretrained model

In [19]:
# load the model
model = load_model('./model_50epochs.h5')

  if weight_names:


### Note: After loading the model run  model.fit()  to continue training form there, if required.

In [20]:
model.fit(X_train, y_train, batch_size=50, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f599bfd81d0>

## Evaluation

In [21]:
print(model.evaluate(X_train, y_train, batch_size = 20))
print('\nModel Performance: Log Loss and Accuracy on validation data')
print(model.evaluate(X_valid, y_valid, batch_size = 20))

[0.07699132082946655, 0.9803379369404642]

Model Performance: Log Loss and Accuracy on validation data
[8.410829422721726, 0.3681003600358963]


## Generate text

In [25]:
import warnings
warnings.filterwarnings('ignore')
seed_length=50
new_words=50
diversity=1
n_gen=1

import random

# Choose a random sequence
seq = random.choice(sequences)

# print seq

# Choose a random starting point
seed_idx = random.randint(0, len(seq) - seed_length - 10)
# Ending index for seed
end_idx = seed_idx + seed_length

gen_list = []

for n in range(n_gen):
    # Extract the seed sequence
    seed = seq[seed_idx:end_idx]
    original_sequence = [idx_word[i] for i in seed]
    generated = seed[:] + ['#']

    # Find the actual entire sequence
    actual = generated[:] + seq[end_idx:end_idx + new_words]
        
    # Keep adding new words
    for i in range(new_words):

        # Make a prediction from the seed
        preds = model.predict(np.array(seed).reshape(1, -1))[0].astype(np.float64)

        # Diversify
        preds = np.log(preds) / diversity
        exp_preds = np.exp(preds)

        # Softmax
        preds = exp_preds / sum(exp_preds)

        # Choose the next word
        probas = np.random.multinomial(1, preds, 1)[0]

        next_idx = np.argmax(probas)

        # New seed adds on old word
        #             seed = seed[1:] + [next_idx]
        seed += [next_idx]
        generated.append(next_idx)
    # Showing generated and actual abstract
    n = []

    for i in generated:
        n.append(idx_word.get(i, '< --- >'))

    gen_list.append(n)

a = []

for i in actual:
    a.append(idx_word.get(i, '< --- >'))

a = a[seed_length:]

gen_list = [gen[seed_length:seed_length + len(a)] for gen in gen_list]

print('Original Sequence: \n'+' '.join(original_sequence))
print("\n")
# print(gen_list)
print('Generated Sequence: \n'+' '.join(gen_list[0][1:]))
# print(a)

Original Sequence: 
these antics were that frolicked with such glee to men whose lives were held in gyves and whose feet might not go free ah wounds of christ they were living things most terrible to see around around they waltzed and wound some wheeled in smirking pairs with the mincing step


Generated Sequence: 
an month in with this welcome to the play which but so but i never indiscreet answers read 21 as a few he two new number wldsp11 txt other j p mr george alexander algernon allen few what one beautiful from the gospel mrs cheveley but the scenes of the
