# Text generation using a RNN

Given a sequence of words from this data, train a model to predict the next word in the sequence. Longer sequences of text can be generated by calling the model repeatedly.

**Mount your Google Drive**

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
import os
from sklearn.model_selection import train_test_split

### Import Keras and other libraries

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

## Download data
Reference: Data is collected from http://www.gutenberg.org

For the lab purpose, you can load the dataset provided by Great Learning

### Load the Oscar Wilde dataset

Store all the ".txt" file names in a list

In [4]:
text_files = os.listdir('/content/gdrive/My Drive/NLP/data/')

### Read the data

Read contents of every file from the list and append the text in a new list

In [5]:
all_words = []
for files in text_files[0:2]: #taking a small subset due to session crashing 
    lines = open('/content/gdrive/My Drive/NLP/data/'+files,'r', errors = 'ignore') 
    words = lines.read()
    all_words.append(words)
  

all_words



## Process the text
Initialize and fit the tokenizer

In [6]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\nâ€™S â€˜', lower=True, split=' ')
tokenizer.fit_on_texts(all_words)

### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping words to numbers, and another for numbers to words.

In [25]:
word_index=tokenizer.word_index
word_to_num = tokenizer.word_index
num_to_word = tokenizer.index_word

Get the word count for every word and also get the total number of words.

In [8]:
word_count = tokenizer.word_counts 
count = 0
for x in word_count:
    count = count + word_count[x]

print("Total Numbner of words present in the vocublary :" , count)
print("Word count for each word : " , word_count)

Total Numbner of words present in the vocublary : 97016


Convert text to sequence of numbers

In [9]:
text_to_sequence_of_numbers = tokenizer.texts_to_sequences(all_words)

In [10]:
len(text_to_sequence_of_numbers)

2

### Generate Features and Labels

In [11]:
input_data = []
output_data = []


for i in text_to_sequence_of_numbers:  
    for j in range(0,len(i) - 10):
        input_seq  = i[j : j + 10]
        output_seq = i[j + 10]
        input_data.append(input_seq)
        output_data.append(output_seq)

  

print('Total number of input arrays: ', len(input_data))
print('Total number of Output arrays: ', len(output_data))
print("Input Data length: ",len(input_data[10]))

Total number of input arrays:  96996
Total number of Output arrays:  96996
Input Data length:  10


### The prediction task

Given a word, or a sequence of words, what is the most probable next word? This is the task we're training the model to perform. The input to the model will be a sequence of words, and we train the model to predict the output—the following word at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the words computed until this moment, what is the next word?

### Generate training and testing data

In [12]:
y = tf.keras.utils.to_categorical(output_data,num_classes = len(word_index)+1)
X_train,X_test, y_train, y_test = train_test_split(input_data,y, test_size=0.30, random_state=42)

This is just to check the features and labels

In [13]:
print(f"X_train shape:{np.array(X_train).shape}")
print(f"X_train shape:{np.array(X_test).shape}")
print(f"X_train shape:{np.array(y_train).shape}")
print(f"X_train shape:{np.array(y_test).shape}")


X_train shape:(67897, 10)
X_train shape:(29099, 10)
X_train shape:(67897, 10851)
X_train shape:(29099, 10851)


In [14]:
import gc
input_data = None
y = None
gc.collect()

0

## Build The Model

Use `keras.Sequential` to define the model. For this simple example three layers are used to define our model:

* `keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `keras.layers.LSTM`: A type of RNN with size `units=rnn_units` (You can also use a GRU layer here.)
* `keras.layers.Dense`: The output layer, with `num_words` outputs.

In [15]:
input_length = 15
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim = len(word_index)+1, output_dim = 300, input_length =input_length))

model.add(tf.keras.layers.LSTM(70, activation = 'tanh', recurrent_activation = 'sigmoid'))

model.add(tf.keras.layers.Dense(len(word_index)+1,activation = 'softmax'))

model.compile(optimizer = 'adam', loss='categorical_crossentropy', metrics = ['accuracy']   )
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 15, 300)           3255300   
_________________________________________________________________
lstm (LSTM)                  (None, 70)                103880    
_________________________________________________________________
dense (Dense)                (None, 10851)             770421    
Total params: 4,129,601
Trainable params: 4,129,601
Non-trainable params: 0
_________________________________________________________________
None


For each word the model looks up the embedding, runs the LSTM one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-liklihood of the next word.

## Train the model

In [16]:
tf.keras.backend.clear_session()

In [17]:
model.fit(np.array(X_train), y_train, batch_size= 30, epochs = 10, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f966a2cb5c0>

### Save Model

In [19]:
model.save('/content/gdrive/My Drive/NLP/data_weights'+'my_model.h5')

## If you have already trained the model and saved it, you can load a pretrained model

In [20]:
model1 = tf.keras.models.load_model('/content/gdrive/My Drive/NLP/data_weights'+'my_model.h5')

### Note: After loading the model run  model.fit()  to continue training form there, if required.

In [21]:
model1.fit(np.array(X_train), y_train, batch_size = 60, epochs = 10, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f966a1fe748>

## Evaluation

In [23]:
model.evaluate(np.array(X_test), y_test)



[7.565272331237793, 0.1267053782939911]

## Generate text

In [27]:
X_test = np.array(X_test)
for i, j in enumerate(X_test[150:155],start = 1):
    text = []
    for k in j:
        text.append(num_to_word[k])
    model_pred = model1.predict(np.array(X_test))[i]
    print('Features: ' + ' '.join(text))
    print('Label: ' + num_to_word[np.argmax(y_test[i])])
    print('Predicated word: ' + num_to_word[np.argmax(model_pred)])
    print(" ")

Features: then rise supreme athena argent limbed and if my lips
Label: of
Predicated word: till
 
Features: the division and terror of the world and the choice
Label: confused
Predicated word: an
 
Features: extreme cold of the north dulls the mental faculties of
Label: or
Predicated word: vanilla
 
Features: o’er — these things are well enough —but thou wert
Label: of
Predicated word: of
 
Features: him steatite from sidon in their painted ships the meanest
Label: with
Predicated word: is
 


##### Copyright 2019 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [18]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.



---



---

