# Recurrent neural network model for text Generation
### Drive imports & Collab execution

## Libraries & imports
Some assets to work on the dataset and train a Deep learning model.
We might need tools for:

* system operations
* string management
* plotting

And off course, getting involved with Deep Learning Models. For this purpose I will be using Tensorflow.

In [None]:
#Utils
import json
import numpy as np
import re
import string
#plot
import matplotlib.pyplot as plt
#Tensorflow
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, Flatten
from tensorflow.keras.layers import LSTM, Bidirectional,GRU, TimeDistributed
from tensorflow.keras.optimizers import RMSprop
#sys
import sys
import io


#drive

from google.colab import drive
drive.mount('/content/drive')


drivepath= "/content/drive/My Drive/TextGen- G Colab/"


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Run this to ensure TensorFlow 2.x is used
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
    pass

# TEXT MANAGEMENT

##Text import and consolidation

three texts are imported to build the model:

* Game of Thrones
* The Bible
* The lord of the rings

They may be similar in content, after all, all three of them use quite often the word *lord*.

For now, the script will load the text sequences, process them into a *clean* non-punctuated text string, and finally tokkenize the text into words.


In [None]:

# fucntion to retrieve and consolidste txt
def build_raw_text(origins,maxlen):
    ans  = ""  #answer string

    each_part = int(maxlen/len(origins)) #how much charaxters of each text 

    for elem in origins: #for every text
        try:
            with open(drivepath+"text_origin_{lib}.txt".format(lib=elem),"r") as fp: #open the text
                lines = ''.join(fp.readlines()) #read lines
                ans = ans +" \n "+ lines[:each_part] #add the lines 
        except e: #if error
            print(e)
    return ans

# function to correct text and okenize into words
def clean_text(doc,filter_words=[]):
    tokens = doc.split() #split the text in words
    table = str.maketrans('', '', string.punctuation) #replace punctuation
    tokens = [w.translate(table) for w in tokens] 
    tokens = [word for word in tokens if word.isalpha()] # only keep alphanumeric tokkens
    tokens = [word.lower() for word in tokens] #lowercase
    if(len(filter_words)!=0):
        tokens = list(filter(lambda k: k not in filter_words, tokens))
    return tokens



#ors = ["bible","got","lotr"]
#ors = ["poemas"]
ors = ["trump"]

maxlim_chars = 5000000 #max limit of characters
raw_text = build_raw_text(ors,maxlim_chars)
token_text = clean_text(raw_text,filter_words=["trump"])

print("Raw text length :",len(raw_text))
print(raw_text[:200])
print("-"*20)
print("Tokkens length :",len(token_text))
print(token_text[:5])

Raw text length : 5000007
 
   In the beginning God created the heaven and the earth.

  And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
wa
--------------------
Tokkens length : 888080
['in', 'the', 'beginning', 'god', 'created']


As we can see, the text modeling went well. We can count the number of characters of the raw text imported and the number of tokkens this operation finally got.

## Token package creation

In order to train our model we might first create a set of inputs-outputs for our model. The final model will work like this: we input a set of $N$ **ordered** words, and we expect as aon output the **next single word** following this sequence. A suitable training set for this purpose would be different sets of $N+1$ words, in which we take out the las word of the sequence and feed the model with the purpose of predicting this last word given the other $N$ words. I call these corresponding sets of $N+1$ words a **token package** and the following part of the code will be focused on creating them.

**NOTE:** Token is just another fancy word for *word*. We use token to describe a compact object or entity in the context of text processing.

In [None]:
N = 50 + 1 # length of token package
max_count = 200000 #max amount of token packages
lines = [] #oputput list with lines

for i in range(N, len(token_text)): # we start the counter in N+1
    seq = token_text[i-N:i] # we select the N+1 tokens previous to the counter
    line = ' '.join(seq) # create the line by joining the words
    lines.append(line) # append the line to the list of lines 
    if i > max_count: 
        break

print("Number of token poackages  :",len(lines))
print(lines[0]) #print the first token package

Number of token poackages  : 199951
in the beginning god created the heaven and the earth and the earth was without form and void and darkness was upon the face of the deep and the spirit of god moved upon the face of the waters and god said let there be light and there was light and


Now we have compact packages of words that may serve for training a learning model. In order for a model to read and process the words in the text, we need to translate them into a type that a numeric model would understand. That is why the next step is to create a *dictionary-like* structure that hel up translate each word into a numerical figure (such as a vector or a scalar).

## Tokeninzer
The *Tokenizer* method allows us to create such dictionary that create an asociation between words and numbers. we first *fit* the tokenizer to create a dictionary with a given set of words. We might need as much words as possible since new words ffor the model will be casssifed as unknown (since they dont have an associated number). We fit the dictionary, and with this dictionary we *translate* new texts into numers using these correspondences. 

The procedure belowe fits the dictionary on the set of words and translate the same set of words using the created dictionary, so we end up with a set of numbers for each token package instead of a set of words.

In [None]:
tokenizer = Tokenizer() # creates a tokenizer object
tokenizer.fit_on_texts(lines) #fits on the lines weve created
sequences = tokenizer.texts_to_sequences(lines) #and tranlate them as well

Now let's take a look into these sequences compared to the lines given before.

In [None]:
print(lines[0]) #fist line
print(sequences[0]) #fisrt line translated

vocab_size = len(tokenizer.word_index) + 1 
print("The vocabulary size is ",vocab_size)

in the beginning god created the heaven and the earth and the earth was without form and void and darkness was upon the face of the deep and the spirit of god moved upon the face of the waters and god said let there be light and there was light and
[7, 1, 1212, 32, 1211, 1, 260, 2, 1, 134, 2, 1, 134, 31, 240, 5627, 2, 1937, 2, 946, 31, 37, 1, 214, 3, 1, 1755, 2, 1, 572, 3, 32, 1602, 37, 1, 214, 3, 1, 289, 2, 32, 26, 96, 63, 15, 597, 2, 63, 31, 597, 2]
The vocabulary size is  5628


we can see the correspondence between words and number that the Tokenizer created.

## Training set creation: splitting token packages
In order to create a succesful training set we might need and  input (X) and an output (Y) so we can teach examples to the learning model. As we mentioned before, the idea of an $N+1$ word package is to take the frst $N$ words as an input and the last one word as the expected output. 

In [None]:
sequences = np.array(sequences) # vectorizing the whole package array  
X  = sequences[:, :-1] # for each sequence, take every element but the last one
Y = sequences[:,-1]  # for each sequence, take the last element
print("For the sequence ", sequences[0])
print("The X input is ", X[0])
print("and the output is ", Y[0] )

seq_length = X.shape[1] # training input sequence length 
print("-"*20)
print("Sequence length: ", seq_length)



For the sequence  [   7    1 1212   32 1211    1  260    2    1  134    2    1  134   31
  240 5627    2 1937    2  946   31   37    1  214    3    1 1755    2
    1  572    3   32 1602   37    1  214    3    1  289    2   32   26
   96   63   15  597    2   63   31  597    2]
The X input is  [   7    1 1212   32 1211    1  260    2    1  134    2    1  134   31
  240 5627    2 1937    2  946   31   37    1  214    3    1 1755    2
    1  572    3   32 1602   37    1  214    3    1  289    2   32   26
   96   63   15  597    2   63   31  597]
and the output is  2
--------------------
Sequence length:  50


## Categorical output: binarizing the Y vector
A neural network model works by activating neurons given some operations made internally between data and trainable weighted matrices. This implies that the output is also a set of activations (each activation is a scalar between 0 and 1). For adapting this kind of output into our numerical output Y, we need to use a **categorical transfomattion method**. This method transforms the scalars given into a vector of length equal to the vocabulary size. If the Y output is (for example) 27, the Y categorical vector will be a vector of length `vocab_size` in which every position is a 0, except for the 27th position that will be a 1.

In [None]:
y = to_categorical(Y, num_classes=vocab_size) # creating categorical vectors Y

# DEEP LEARNING MODEL
Once created the dataset, we might proceed to create the model to be rained and then, train it with the examples we´ve created.

## Model architecture

A **sequential model** is a valid architecture for this excercise: we proceed to create layers of neurons that are conected with the previous and next layers of neurons. In this model we will expose the core of the text generation algorith: LSTM neurons. These **layers of neurons** are no longer just transmitting informaion forward, but also keeping notion of the order of the inputs (that is why the sequences needed to be ordered) by keeping **hidden states** that can be transmitted within the layer itself.

In [None]:
def create_model(lstm_neurons,dense_neurons):

    model = Sequential() # sequential model creation

    # add the embedding layer to reduce dimensionality in input vectors
    model.add(Embedding(vocab_size, N-1, input_length=seq_length)) 

    # first LSTM layer, as the next layer is also LSTM this must return seuqnces
    model.add(LSTM(lstm_neurons, return_sequences=True))

    # second LSTM layer
    model.add(LSTM(lstm_neurons))

    # first dense layer
    model.add(Dense(dense_neurons, activation='relu'))

    #output layer, must be of the categorical Y length i.e vocab_size
    model.add(Dense(vocab_size, activation='softmax'))
    #return model
    return model

model = create_model(64,128) #creating the model
model.summary() #check model characteristics and trainable parameters

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            281400    
_________________________________________________________________
lstm (LSTM)                  (None, 50, 64)            29440     
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense (Dense)                (None, 128)               8320      
_________________________________________________________________
dense_1 (Dense)              (None, 5628)              726012    
Total params: 1,078,196
Trainable params: 1,078,196
Non-trainable params: 0
_________________________________________________________________


In [None]:
'''
for compiling the model we use:
  * categorical cross entropy for checking the loss
  * ADAM as the algorithm for "surfing" the error gradient 
  * Accuracy to measure performance
'''
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

## Model fitting

now, the long wait . . . 
The training stage is just the model reading the examples given, many times over and over in order to calibrate the weight matrices. Each time the model reads all the examples is an **epoch** and we will use 50 of this loops to train the model. The model will read batches of 256 examples each time.

In [None]:
model.fit(X, y, batch_size = 256, epochs = 50) #fit the model

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fe54ceba630>

# TEXT GENERATION

once the model is trained, we can proceed with the main goal of this notebook: The continous generation of text. For this, we need to take into account that the model receives an $N$ word set as a **seed** for generationg the next word. Given this, it is important that this seed is coherent wih the text structure given as a training set, since the model learned not just to recognize words and to predict them, but to verify the order of the seed words in order to give a reasonaable output as an answer. 


`generate_text_seq()` generates `n_words` number of words after the given `seed_text`. We are going to pre-process the seed_text before predicting. We are going to encode the seed_text using the same encoding used for encoding the training data. Then we are going to convert the seed_text to $N$ words by using `pad_sequences()`. Now we will predict using `model.predict_classes()`. After that we will search the word in tokenizer using the index in `y_predict` (the output vector). Finally we will append the predicted word to seed_text and text and repeat the process.

In [None]:
'''
This method receives:
 * model: the trained model
 * tokenizer: the fitted dictionary
 * text_seq_length: the N length of each word sequence
 * seed_text: the words that will be used as seed for text generation
 * n_words: the number of word to be generated
'''


def generate_text_seq(model, tokenizer, text_seq_length, seed_text, n_words):
  
    text = []

    for _ in range(n_words):
        # translate the seed text
        encoded = tokenizer.texts_to_sequences([seed_text])[0]

        #make the sequences of length N (by padding or truncating)
        encoded = pad_sequences([encoded], maxlen = text_seq_length, truncating='pre') 

        # generate the response vector using the trained model
        y_predict = model.predict_classes(encoded) 

        #get the predicted word
        predicted_word = '' 
        for word, index in tokenizer.word_index.items():
          #find it in dictionary
          if index == y_predict:
            predicted_word = word 
            break

    # append the new word to the seed_text for the next word
    seed_text = seed_text + ' ' + predicted_word
    #append the new word to the list of words
    text.append(predicted_word)

    #return a jointed list of words
    return ' '.join(text)

### Checking results

Now, lets generate some text.

In [None]:
n_lines  = len(lines)
print("There are ",n_lines," lines")
seed_text = lines[190000] # use arbitrary line as seed
seed_text

There are  199951  lines


'founder who made thereof a graven image and a molten image and they were in the house of micah and the man micah had an house of gods and made an ephod and teraphim and consecrated one of his sons who became his priest in those days there was no king'

In [None]:
#generate text (100 words)
text_gen = generate_text_seq(loaded_model, tokenizer, seq_length, seed_text, 100)

print(seed_text + ' ' + text_gen)

founder who made thereof a graven image and a molten image and they were in the house of micah and the man micah had an house of gods and made an ephod and teraphim and consecrated one of his sons who became his priest in those days there was no king of slaying him and slew of the pursuers and the lord said unto moses stretch out of the land of egypt and the lord said unto moses stretch out of the land of egypt and the lord spake unto moses saying speak unto the children of israel and say unto them this day and the lord spake unto moses saying speak unto the children of israel and say unto them this day and the lord spake unto moses saying speak unto the children of israel and say unto them this day and the lord said unto moses take yourselves in


# Extra assets

This section might be useful to test results or check some settings in the notebook.

## Using TPU


In [None]:
# IS TPU GOING TO BE USED?
## THIS MIGHT NOT BE AVAILABLE FOR LOCAL EXECUTION (TPU AVILABLE IN GOOGLE COLAB)

call_TPU = True
if(call_TPU):
    %tensorflow_version 2.x
import tensorflow as tf
print("Tensorflow version " + tf.__version__)

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
    print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
    raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

## Saving model to disk

In [None]:
# serialize model to JSON
model_json = model.to_json()
with open(drivepath+"model_textgen_v3.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights(drivepath+"model_textgen_v3_weights.h5")
print("Saved model to disk")

Saved model to disk


## Loading model from disk

In [None]:
from keras.models import model_from_json
# load json and create model
json_file = open(drivepath+"model_textgen_v3.json", 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights(drivepath+"model_textgen_v3_weights.h5")
print("Loaded model from disk")

Loaded model from disk
