# Text/Lyrics/TV Script generator with LSTM

Let's get our hands dirty with LSTM Models. In this project, you will learn how to generate texts using Long Short-Term Memory Networks. The kind of text the model will generate depends on the training data you give the network to learn from. If the training text corresponds to Lyrics from your favorite singer, the LSTM model will learn to write songs like him/her. 

In this project we will use TV Scripts from the Simpsons series. The code is based on Jason Brownlee's article [How to Develop a Word-Level Neural Language Model and Use it to Generate Text](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/). 

Let's follow together the flow of the project, beginning with some data preparation, followed by LSTM structure definition, training and test. 


## Data preparation

In this part of the project the text file will be uploaded in memory, cleaned (deleting punctuation and non-alphabetic terms, as well as converting all capital letters to lower case) to do the learning process for the LSTM Network easier, the words will be tokenized (text to number transformation), since the model works at the end with numbers and not with strings. The lines are also organized in sequences of a fixed sequence length, because of the fixed amount of input neurons the network has. The sequence length depends on the mean length of a sentence in the text, which can be known from the statistics of your dataset.  

In [1]:
# Data preparation
import string

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# load document
in_filename = 'simpsons.txt'
doc = load_doc(in_filename)
print(doc[:500])
print("\n")

# clean document
tokens = clean_doc(doc)
print(tokens[:100])
print("\n")
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

# organize into sequences of tokens
length = 12 + 1      # +1 refers to the to-be-forecast word
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

# save sequences to file
out_filename = 'simpsons_sequences_len12.txt'
save_doc(sequences, out_filename)

[YEAR DATE 1989] Ã‚Â© Twentieth Century Fox Film Corporation. All rights reserved.

Moe_Szyslak: (INTO PHONE) Moe's Tavern. Where the elite meet to drink.
Bart_Simpson: Eh, yeah, hello, is Mike there? Last name, Rotch.
Moe_Szyslak: (INTO PHONE) Hold on, I'll check. (TO BARFLIES) Mike Rotch. Mike Rotch. Hey, has anybody seen Mike Rotch, lately?
Moe_Szyslak: (INTO PHONE) Listen you little puke. One of these days I'm gonna catch you, and I'm gonna carve my name on your back with an ice pick.
Moe_Sz


['year', 'date', 'twentieth', 'century', 'fox', 'film', 'corporation', 'all', 'rights', 'reserved', 'moeszyslak', 'into', 'phone', 'moes', 'tavern', 'where', 'the', 'elite', 'meet', 'to', 'drink', 'bartsimpson', 'eh', 'yeah', 'hello', 'is', 'mike', 'there', 'last', 'name', 'rotch', 'moeszyslak', 'into', 'phone', 'hold', 'on', 'ill', 'check', 'to', 'barflies', 'mike', 'rotch', 'mike', 'rotch', 'hey', 'has', 'anybody', 'seen', 'mike', 'rotch', 'lately', 'moeszyslak', 'into', 'phone', 'listen', 

# Structure definition and modelling of LSTM networks

The following part of the projects defines the structure of the LSTM network and runs the training process. The model is developed in order to forecast one word at a time, based on a given sequence. For example, if the sequence is "I come from Spain, my mothertongue is", the LSTM Model should forecast "Spanish" (of course, if it could learn the dependencies between Spain and Spanish from the training dataset). 

To describe the LSTM model we are using the library Keras as wrapper of tensorflow, called from this last one. If you want to use the GPU, you have to install in the conda environment tensorflow-gpu, but still import tensorflow as shown bellow. 

When defining the LSTM Network, an [Embedding Layer](https://www.tensorflow.org/guide/embedding) is added. This is used for vector representations of words, called "word embeddings", as described in the article [Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) from Mikolov et al. This procesure helps the network to learn the relationships between words, instead of considering each term as independent. If you want a really good explanation off the reason to use embeddings for words  vectorization, we recommend you to read the article [Vector Representations of Words](https://www.tensorflow.org/tutorials/representation/word2vec) from the TensorFlow website. 

We recommend you to play with the hyperparameters of the model (number of layers, neurons, optimizers, batch sizes, epochs, etc.), in order to obtain better results. You can also experiment using an hyperparameter optimization library, instead of changing the values by hand. For this, you can use the library [GpyOpt](https://github.com/SheffieldML/GPyOpt) for Bayesian Optimization or [HyperOpt](http://hyperopt.github.io/hyperopt/), which provides random search and Tree of Parzen Estimators algorithms. To learn more about hyperparameter optimization methods, refer to these articles:
- [Hyperparameter Optimization in Machine Learning Models](https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models) from Sayak Paul.
- [Automated Machine Learning Hyperparameter Tuning in Python](https://towardsdatascience.com/automated-machine-learning-hyperparameter-tuning-in-python-dfda59b72f8a) from Will Koehrsen.

In [2]:
# Libraries upload
from numpy import array
from pickle import dump
import tensorflow as tf
from tensorflow.python import keras
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.utils import to_categorical
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.layers import LSTM
from tensorflow.python.keras.layers import Embedding
from tensorflow.python.keras.optimizers import Adam

In [3]:
# Code for fitting the language model
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'simpsons_sequences_len12.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

# define model
model = Sequential()
model.add(Embedding(vocab_size, 150, input_length=seq_length))
# In case you want to add more LSTM layers, you have to turn on the input parameter return_sequences, as follows:
# model.add(LSTM(512, return_sequences=True))
model.add(LSTM(512))
model.add(Dense(512, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
adam_opt = Adam(clipvalue=1)
model.compile(loss='categorical_crossentropy', optimizer=adam_opt, metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=64, epochs=150)

# save the model to file
model.save('model_simpsons_150epochs.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer_simpsons_150epochs.pkl', 'wb'))

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 12, 150)           987000    
_________________________________________________________________
lstm (LSTM)                  (None, 512)               1357824   
_________________________________________________________________
dense (Dense)                (None, 512)               262656    
_________________________________________________________________
dense_1 (Dense)              (None, 6580)              3375540   
Total params: 5,983,020
Trainable params: 5,983,020
Non-trainable params: 0
_________________________________________________________________
None
Instructions for updating:
Use tf.cast instead.
Epoch 1/150
 7424/48660 [===>..........................] - ETA: 3:02 - loss: 7.2813 - acc: 0.0251

KeyboardInterrupt: 

## Model test

After modelling, we are ready to test the LSTM network by generating new scripts for The Simpsons. For this, a sequence will be inserted in the already trained model, as a seed. The model will generate the most probable next word for the sequence. The predicted word will be appended to the original sentence and inserted again as input to the model to predict another word. This process will be done iteratively until *n_words* are generated. 

In [None]:
#  code for generating text from the learned-language model is listed below
from random import randint
from pickle import load
import tensorflow as tf
from tensorflow.python import keras
from tensorflow.python.keras.models import load_model
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

# load cleaned text sequences
in_filename = 'simpsons_sequences_len12.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1

# load the model
model = load_model('model_simpsons_150epochs.h5')

# load the tokenizer
tokenizer = load(open('tokenizer_simpsons_150epochs.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)