# Text Generation using Bidirectional LSTM and Doc2Vec models

The purpose of [this article](https://medium.com/@david.campion/text-generation-using-bidirectional-lstm-and-doc2vec-models-1-3-8979eb65cb3a) is to discuss about text generation, using machine learning approaches, especially neural networks.

It is not the first article about it, and probably not the last. Actually, there is a lot of litterature about text generation using "AI" techniques, and some codes are available to generate texts from existing novels, trying to create new chapters for **"Game of Thrones"**, **"Harry Potter"**, or a new piece in the style of **Shakespears**. Sometimes with interesting results.

Mainly, these approaches are using classic LSTM networks, and the are pretty fun to be experimented.

However, generated texts provide a taste of unachievement. Generated sentences seems quite right, whith correct grammar and syntax, as if the neural network was understanding correctly the structure of a sentence. But the whole new text does not have great sense. If it is not complete nosense. 

This problem could come from the approach itself, using only LSTM to generate text word by word. But how can we improve them ? In this article, I will try to investigate a new way to generate sentences.

It does not mean that I will use something completely different from LTSM : I am not, I will use LTSM network to generate sequences of words. However I will try to go further than a classic LSTM neural network and I will use an additional neural network (LSTM again), to select the best phrases.

Then, this article can be used as a tutorial. It describes :
 1. **how to train a neural network to generate sentences** (i.e. sequences of words), based on existing novels. I will use a bidirectional LSTM Architecture to perform that.
 2. **how to train a neural network to select the best next sentence for given paragraph** (i.e. a sequence of sentences). I will also use a bidirectional LSTM archicture, in addition to a Doc2Vec model of the target novels.


### Note about Data inputs
As data inputs, I will not use texts which are not free in term of intellectual properties. So I will not train the solution to create a new chapter for **"Game of Throne"** or **"Harry Potter"**.
Sorry about that, there is plenty of "free" text to perform such texts generation exercices and we can dive into the [Gutemberg project](http://www.gutenberg.org), which provides huge amount of texts (from [William Shakespears](http://www.gutenberg.org/ebooks/author/65) to [H.P. Lovecraft](http://www.gutenberg.org/ebooks/author/34724), or other great authors).

However, I am also a french author of fantasy and Science fiction. So I will use my personnal material to create a new chapter of my stories, hoping it can help me in my next work!

So, I will base this exercice on **"Artistes et Phalanges"**, a french fantasy novel I wrote over the 10 past years, wich I hope will be fair enough in term of data inputs. It contains more than 830 000 charaters.

By the way, if you're a french reader and found of fantasy, you can find it on iBook store and Amazon Kindle for free... Please note I provide also the data for free on my github repository. Enjoy it!

## 1. a Neural Network for Generating Sentences

The first step is to generate sentences in the style of a given author.

There is huge litterature about it, espacially using LSTM to perform such task. As this kind of network are working well for this job, we will use them.

The purpose of this note is not to deep dive into LSTM description, you can find very great article about them and I suggest you to read [this article](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) from Andrej Karpathy.

You can also find easily existing code to perform text generation using LSTM. On my github, you can find two tutorials, one using [Tensorflow](https://github.com/campdav/text-rnn-tensorflow), and another one using [Keras](https://github.com/campdav/text-rnn-keras) (over tensorflow), that is easier to understand.

For this first part of these exercice, I will re-use these materials, but with few improvements :
 - Instead of a simple _LSTM_, I will use a _bidirectional LSTM_. This network configuration converge faster than a single LSTM (less epochs are required), and from empiric tests, seems better in term of accuracy. You can have a look at [this article](https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/) from Jason Brownlee, for a good tutorial about bidirectional LSTM.
 - I will use Keras, which require less complexity to create the network of is more readible than conventional Tensorflow code.

### 1.1. What is the neural network task in our case ?

LSTM (Long Short Term Memory) are very good for analysing sequences of values and predicting the next values from them. For example, LSTM could be a very good choice if you want to predict the very next point of a given time serie (assuming a correlation exist in the sequence).

Talking about sentences and texts ; phrases (sentences) are basically sequences of words. So, it is natural to assume LSTM could be usefull to generate the next word of a given sentence.

In summary, the objective of a LSTM neural network in this situation is to guess the next word of a given sentence.

For example:
What is the next word of this following sentence : "he is walking down the"

Our neural net will take the sequence of words as input : "he", "is", "walking", ...
Its ouput will be a matrix providing the probability for each word from the dictionnary to be the next one of the given sentence.

Then, how will we build the complete text ? Simply iterating the process, by switching the setence by one word, including the new guessed word at its end. Then, we guess a new word for this new sentence. ad vitam aeternam.

### 1.1.1. Process

In order to do that, first, we build a dictionary containing all words from the novels we want to use.

 1. read the data (the novels we want to use),
 1. create the dictionnary of words,
 2. create the list of sentences,
 3. create the neural network,
 4. train the neural network,
 5. generate new sentences.

In [1]:
from __future__ import print_function
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM, Input, Flatten, Bidirectional
from keras.layers.normalization import BatchNormalization
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.metrics import categorical_accuracy
import numpy as np
import random
import sys
import os
import time
import codecs
import collections
from six.moves import cPickle

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We have raw text and a lot of things have to be done to use them: split them in words list, etc.
In order to do that, I use the spacy library which is incredible to deal with texts. For this exercice, I will only use very few options from spacy.

In [2]:
#import spacy, and french model
import spacy
nlp = spacy.load('en')

# parameters

In [3]:
data_dir = 'data'# data directory containing input.txt
save_dir = 'save' # directory to store models
seq_length = 20 # sequence length
sequences_step = 1 #step to create sequences

In [4]:
vocab_file = os.path.join(save_dir, "words_vocab.pkl")

# read data

I create a specific function to create a list of words from raw text. I use spacy library, with a specific function to retrieve only lower character of the words and remove carriage returns (\n).

I am doing that because I want to reduce the number of potential words in my dictionnary, and I assume we do not have to avoid capital letters. Indeed, they are only part of the syntax of the text, it's shape, and do not deals with its sense.

In [5]:
def create_wordlist(doc):
    wl = []
    for word in doc:
        wl.append(word.text.lower())
    return wl

In [11]:
def create_wordlist(doc):
    wl = []
    for word in doc:
        if word.text not in (
#                              " ",
#                              "\n\t",
#                              '\u2009',
#                              '\xa0',
#                              '\n\n\n          ',
#                              '\n\n\n\t',
#                              '\n\n\n\n\n\n\t',
#                              '\n\n\n               ',
                             '\n\n\n',
#                              '\n\n\n\n\n\n',
#                              '\n\n\n\n\n\n               ',
#                              '\n\n\n          \n\n\n          ',
#                              '\n\n\n\n\n\n          ',
#                              '\n\n\n\n\n\n     ',
#                              '\n\n\n     ',
#                              '\n\n',
#                              '\n\n\n\n',
#                              '\n\n\n\n\n\n    ',
#                              '\n\n\n    ',
#                              '\n\n\n\n\n\n\n\n\n\t',
#                              '\n\n\n\n\n\n      '
                             
                            ):
            wl.append(word.text.lower())
    return wl

Create the list of sentences:

In [44]:
input_file = os.path.join(data_dir, "Combinations_of_Several_Movies.txt")
with open(input_file, encoding="utf8") as f:
    data = f.read()
    doc = nlp(data)
    print(doc)

INT.  SUBWAY





With the clash of cymbals, the train crashes into the wall of


rubble.





EXT.  NEW GOVERNMENT BUILDING





The entire building opens like a time-lapsed rose blooming


with brilliant orange petals of flame.





EXT.  CITY STREET





The crowd is awash in the baptismal glow of erupting flame.





EXT.  ROOFTOP





Evey watches the explosion, a star-burst of flaming debris


searing against the night sky like fireworks.





EXT.  CITY STREET





The masses burst through the barricades with a euphoric


frenzy.





EXT.  ROOFTOP





The explosion begins to slowly die.



EXT.  CITY STREET





An enormous crowd has begun to gather in the streets


surrounding the New Government Building.  With the crowd, a


restlessness swells against each barricade erected by the


military.





A sergeant stands on an armored car, speaking through a


megaphone.



The Leader screams.





V drives the knife into his heart, killing him instantly.





V stands alone amid




In [12]:
wordlist = []
for file_name in ['Combinations_of_Several_Movies']:
    input_file = os.path.join(data_dir, file_name + ".txt")
    #read data
    with open(input_file, encoding="utf8") as f:
        data = f.read()
    #create sentences
    doc = nlp(data)
    wl = create_wordlist(doc)
    wordlist = wordlist + wl

In [13]:
wordlist

['int',
 '.',
 ' ',
 'subway',
 '\n\n\n\n\n\n',
 'with',
 'the',
 'clash',
 'of',
 'cymbals',
 ',',
 'the',
 'train',
 'crashes',
 'into',
 'the',
 'wall',
 'of',
 'rubble',
 '.',
 '\n\n\n\n\n\n',
 'ext',
 '.',
 ' ',
 'new',
 'government',
 'building',
 '\n\n\n\n\n\n',
 'the',
 'entire',
 'building',
 'opens',
 'like',
 'a',
 'time',
 '-',
 'lapsed',
 'rose',
 'blooming',
 'with',
 'brilliant',
 'orange',
 'petals',
 'of',
 'flame',
 '.',
 '\n\n\n\n\n\n',
 'ext',
 '.',
 ' ',
 'city',
 'street',
 '\n\n\n\n\n\n',
 'the',
 'crowd',
 'is',
 'awash',
 'in',
 'the',
 'baptismal',
 'glow',
 'of',
 'erupting',
 'flame',
 '.',
 '\n\n\n\n\n\n',
 'ext',
 '.',
 ' ',
 'rooftop',
 '\n\n\n\n\n\n',
 'evey',
 'watches',
 'the',
 'explosion',
 ',',
 'a',
 'star',
 '-',
 'burst',
 'of',
 'flaming',
 'debris',
 'searing',
 'against',
 'the',
 'night',
 'sky',
 'like',
 'fireworks',
 '.',
 '\n\n\n\n\n\n',
 'ext',
 '.',
 ' ',
 'city',
 'street',
 '\n\n\n\n\n\n',
 'the',
 'masses',
 'burst',
 'through',
 'th

## Create dictionnary

The first step is to create the dictionnary, it means, the list of all words contained in texts. For each word, we will assign an index to it. 

In [14]:
word_counts.most_common

<bound method Counter.most_common of Counter({'.': 2806, 'the': 2123, '\n\n\n': 1926, ',': 1520, '\n\n\n\n\n\n': 1515, 'a': 993, 'and': 704, ' ': 700, 'of': 602, 'his': 492, 'he': 475, 'in': 474, '-': 472, 'to': 442, 'is': 440, "'s": 333, 'it': 285, 'on': 278, 'as': 235, 'int': 234, 'at': 199, 'with': 185, 'her': 178, 'out': 163, 'into': 160, 'him': 160, 'joe': 151, 'from': 149, 'up': 147, 'ext': 134, 'day': 132, 'she': 130, '--': 127, 'then': 127, 'max': 127, 'an': 125, 'are': 115, 'back': 111, 'room': 109, 'down': 102, 'man': 102, 'neo': 102, 'that': 100, 'like': 98, 'through': 95, 'looks': 92, 'door': 89, 'we': 89, 'but': 84, '\n\n': 84, 'anderson': 84, 'for': 82, 'they': 81, ':': 76, 'has': 73, 'train': 72, 'night': 70, 'over': 69, 'old': 69, 'kelvin': 65, 'car': 63, 'them': 63, 'caleb': 63, 'by': 61, 'one': 60, 'open': 60, 'off': 60, 'around': 58, 'apartment': 58, 'sees': 57, '(': 57, ')': 57, '\n\n\n\n': 56, 'eyes': 56, 'there': 56, 'face': 54, 'street': 52, 'house': 52, 'their':

In [15]:
word_counts = collections.Counter(wordlist)
word_counts.most_common

<bound method Counter.most_common of Counter({'.': 2806, 'the': 2123, ',': 1520, '\n\n\n\n\n\n': 1515, 'a': 993, 'and': 704, ' ': 700, 'of': 602, 'his': 492, 'he': 475, 'in': 474, '-': 472, 'to': 442, 'is': 440, "'s": 333, 'it': 285, 'on': 278, 'as': 235, 'int': 234, 'at': 199, 'with': 185, 'her': 178, 'out': 163, 'into': 160, 'him': 160, 'joe': 151, 'from': 149, 'up': 147, 'ext': 134, 'day': 132, 'she': 130, '--': 127, 'then': 127, 'max': 127, 'an': 125, 'are': 115, 'back': 111, 'room': 109, 'down': 102, 'man': 102, 'neo': 102, 'that': 100, 'like': 98, 'through': 95, 'looks': 92, 'door': 89, 'we': 89, 'but': 84, '\n\n': 84, 'anderson': 84, 'for': 82, 'they': 81, ':': 76, 'has': 73, 'train': 72, 'night': 70, 'over': 69, 'old': 69, 'kelvin': 65, 'car': 63, 'them': 63, 'caleb': 63, 'by': 61, 'one': 60, 'open': 60, 'off': 60, 'around': 58, 'apartment': 58, 'sees': 57, '(': 57, ')': 57, '\n\n\n\n': 56, 'eyes': 56, 'there': 56, 'face': 54, 'street': 52, 'house': 52, 'their': 51, 'window': 5

In [16]:
# count the number of words
word_counts = collections.Counter(wordlist)

# Mapping from index to word : that's the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]

#size of the vocabulary
vocab_size = len(words)
print("vocab size: ", vocab_size)

#save the words and vocabulary
with open(os.path.join(vocab_file), 'wb') as f:
    cPickle.dump((words, vocab, vocabulary_inv), f)

vocab size:  4641


## create sequences
Now, we have to create the input data for our LSTM. We create two lists:
 - **sequences**: this list will contain the sequences of words used to train the model,
 - **next_words**: this list will contain the next words for each sequences of the **sequences** list.
 
In this exercice, we assume we will train the network with sequences of 30 words (seq_length = 30).

So, to create the first sequence of words, we take the 30th first words in the **wordlist** list. The word 31 is the next word of this first sequence, and is added to the **next_words** list.

Then we jump by a step of 1 (sequences_step = 1 in our example) in the list of words, to create the second sequence of words and retrieve the second "next word".

We iterate this task until the end of the list of words.

In [17]:
#create sequences
sequences = []
next_words = []
for i in range(0, len(wordlist) - seq_length, sequences_step):
    sequences.append(wordlist[i: i + seq_length])
    next_words.append(wordlist[i + seq_length])

print('nb sequences:', len(sequences))

nb sequences: 36005


When we iterate over the whole list of words, we create 172104 sequences of words, and retrieve, for each of them, the next word to be predicted.

However, these lists cannot be used "as is". We have to transform them in order to ingest them in the LSTM. Text will not be understood by neural net, we have to use digits.
However, we cannot only map a words to its index in the vocabulary, as it does not represent intrasinqly the word. It is better to reorganize a sequence of words as a matrix of booleans.

So, we create the matrix X and y :
 - X : the matrix of the following dimensions:
     - number of sequences,
     - number of words in sequences,
     - number of words in the vocabulary.
 - y : the matrix of the following dimensions:
     - number of sequences,
     - number of words in the vocabulary.
 
For each word, we retrieve its index in the vocabulary, and we set to 1 its position in the matrix.

In [18]:
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)

In [19]:
len(sequences), seq_length, vocab_size

(36005, 20, 4641)

In [20]:
44048*10*4647*4/(2**30)

7.625336050987244

In [21]:
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
    for t, word in enumerate(sentence):
        X[i, t, vocab[word]] = 1
    y[i, vocab[next_words[i]]] = 1

In [22]:
from sys import getsizeof
getsizeof(X) / (2**30)

3.1124653555452824

# Build Model

Now, here come the fun part. The creation of the neural network.
As you will see, I am using Keras which provide very good abstraction to design an architecture.

In this example, I create the following neural network:
 - bidirectional LSTM,
 - with size of 256 and using RELU as activation,
 - then a dropout layer of 0,6 (it's pretty high, but necesseray to avoid quick divergence)
 

The net should provide me a probability for each word of the vocabulary to be the next one after a given sentence. So I end it with:

 - a simple dense layer of the size of the vocabulary,
 - a softmax activation.
 
I use ADAM as otpimizer and the loss calculation is done on the categorical crossentropy.

Here is the function to build the network:

In [23]:
def bidirectional_lstm_model(seq_length, vocab_size):
    print('Build LSTM model.')
    model = Sequential()
    model.add(Bidirectional(LSTM(rnn_size, activation="relu"),input_shape=(seq_length, vocab_size)))
    model.add(Dropout(0.6))
    model.add(Dense(vocab_size))
    model.add(Activation('softmax'))
    
    optimizer = Adam(lr=learning_rate)
    callbacks=[EarlyStopping(patience=2, monitor='val_loss')]
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[categorical_accuracy])
    return model

In [24]:
rnn_size = 256 # size of RNN
batch_size = 32 # minibatch size
seq_length = 20 # sequence length
num_epochs = 50 # number of epochs
learning_rate = 0.001 #learning rate
sequences_step = 1 #step to create sequences

In [25]:
md = bidirectional_lstm_model(seq_length, vocab_size)
md.summary()

Build LSTM model.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_1 (Bidirection (None, 512)               10031104  
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 4641)              2380833   
_________________________________________________________________
activation_1 (Activation)    (None, 4641)              0         
Total params: 12,411,937
Trainable params: 12,411,937
Non-trainable params: 0
_________________________________________________________________


If a print the summary of this model, you can see it has close to 61 millions of trainable parameters. It is huge, and the compute will take some time to complete.

## train data

Enough speech, we train the model now. We shuffle the training set and extract 10% of it as validation sample. We simply run :

In [26]:
#fit the model
callbacks=[EarlyStopping(patience=4, monitor='val_loss'),
           ModelCheckpoint(filepath=save_dir + "/" + 'my_model_gen_sentences_lstm.{epoch:02d}-{val_loss:.2f}.hdf5',\
                           monitor='val_loss', verbose=0, mode='auto', period=2)]
history = md.fit(X, y,
                 batch_size=batch_size,
                 shuffle=True,
                 epochs=num_epochs,
                 callbacks=callbacks,
                 validation_split=0.01)

Train on 35644 samples, validate on 361 samples
Epoch 1/50


 4480/35644 [==>...........................] - ETA: 1:11:15 - loss: 8.4443 - categorical_accuracy: 0.0000e+0 - ETA: 51:44 - loss: 8.4425 - categorical_accuracy: 0.0000e+00  - ETA: 43:57 - loss: 8.4413 - categorical_accuracy: 0.0000e+0 - ETA: 39:40 - loss: 8.4402 - categorical_accuracy: 0.0078    - ETA: 37:15 - loss: 8.4390 - categorical_accuracy: 0.018 - ETA: 35:28 - loss: 8.4384 - categorical_accuracy: 0.015 - ETA: 34:22 - loss: 8.4366 - categorical_accuracy: 0.022 - ETA: 33:22 - loss: 8.4354 - categorical_accuracy: 0.035 - ETA: 32:44 - loss: 8.4333 - categorical_accuracy: 0.038 - ETA: 32:08 - loss: 8.4298 - categorical_accuracy: 0.043 - ETA: 31:40 - loss: 8.4271 - categorical_accuracy: 0.048 - ETA: 31:15 - loss: 8.4245 - categorical_accuracy: 0.044 - ETA: 30:52 - loss: 8.4196 - categorical_accuracy: 0.052 - ETA: 30:36 - loss: 8.4128 - categorical_accuracy: 0.051 - ETA: 30:19 - loss: 8.3984 - categorical_accuracy: 0.056 - ETA: 30:06 - loss: 8.3758 - categorical_accuracy: 0.064 - ETA: 













Epoch 2/50


 4512/35644 [==>...........................] - ETA: 26:16 - loss: 5.9305 - categorical_accuracy: 0.062 - ETA: 26:39 - loss: 5.8703 - categorical_accuracy: 0.093 - ETA: 26:30 - loss: 5.7347 - categorical_accuracy: 0.135 - ETA: 26:30 - loss: 5.8441 - categorical_accuracy: 0.117 - ETA: 26:34 - loss: 5.6795 - categorical_accuracy: 0.131 - ETA: 26:36 - loss: 5.6551 - categorical_accuracy: 0.135 - ETA: 26:30 - loss: 5.7030 - categorical_accuracy: 0.142 - ETA: 26:29 - loss: 5.8256 - categorical_accuracy: 0.128 - ETA: 26:34 - loss: 5.7531 - categorical_accuracy: 0.131 - ETA: 26:29 - loss: 5.7725 - categorical_accuracy: 0.121 - ETA: 26:29 - loss: 5.7216 - categorical_accuracy: 0.122 - ETA: 26:27 - loss: 5.7248 - categorical_accuracy: 0.117 - ETA: 26:24 - loss: 5.7586 - categorical_accuracy: 0.115 - ETA: 26:24 - loss: 5.7175 - categorical_accuracy: 0.111 - ETA: 26:23 - loss: 5.7449 - categorical_accuracy: 0.110 - ETA: 26:22 - loss: 5.7257 - categorical_accuracy: 0.113 - ETA: 26:20 - loss: 5.6961













Epoch 3/50


 4512/35644 [==>...........................] - ETA: 27:39 - loss: 6.0311 - categorical_accuracy: 0.062 - ETA: 27:17 - loss: 5.7259 - categorical_accuracy: 0.031 - ETA: 27:08 - loss: 5.5537 - categorical_accuracy: 0.093 - ETA: 27:07 - loss: 5.5676 - categorical_accuracy: 0.093 - ETA: 26:59 - loss: 5.5045 - categorical_accuracy: 0.100 - ETA: 26:57 - loss: 5.4628 - categorical_accuracy: 0.109 - ETA: 26:52 - loss: 5.5727 - categorical_accuracy: 0.102 - ETA: 26:47 - loss: 5.5553 - categorical_accuracy: 0.101 - ETA: 26:47 - loss: 5.5143 - categorical_accuracy: 0.097 - ETA: 26:43 - loss: 5.5798 - categorical_accuracy: 0.096 - ETA: 26:41 - loss: 5.5846 - categorical_accuracy: 0.099 - ETA: 26:39 - loss: 5.5931 - categorical_accuracy: 0.096 - ETA: 26:36 - loss: 5.5806 - categorical_accuracy: 0.103 - ETA: 26:35 - loss: 5.5515 - categorical_accuracy: 0.116 - ETA: 26:31 - loss: 5.5208 - categorical_accuracy: 0.118 - ETA: 26:28 - loss: 5.5567 - categorical_accuracy: 0.123 - ETA: 26:27 - loss: 5.5809













Epoch 4/50


 4512/35644 [==>...........................] - ETA: 26:59 - loss: 5.3662 - categorical_accuracy: 0.187 - ETA: 26:53 - loss: 5.2652 - categorical_accuracy: 0.187 - ETA: 26:59 - loss: 5.1273 - categorical_accuracy: 0.156 - ETA: 26:47 - loss: 5.0470 - categorical_accuracy: 0.179 - ETA: 26:44 - loss: 4.8396 - categorical_accuracy: 0.200 - ETA: 26:43 - loss: 4.7976 - categorical_accuracy: 0.218 - ETA: 26:42 - loss: 4.9086 - categorical_accuracy: 0.214 - ETA: 26:41 - loss: 5.1024 - categorical_accuracy: 0.203 - ETA: 26:49 - loss: 5.0471 - categorical_accuracy: 0.197 - ETA: 26:45 - loss: 5.1445 - categorical_accuracy: 0.184 - ETA: 26:43 - loss: 5.1356 - categorical_accuracy: 0.184 - ETA: 26:42 - loss: 5.1322 - categorical_accuracy: 0.177 - ETA: 26:38 - loss: 5.0873 - categorical_accuracy: 0.177 - ETA: 26:36 - loss: 5.1065 - categorical_accuracy: 0.178 - ETA: 26:34 - loss: 5.1267 - categorical_accuracy: 0.172 - ETA: 26:33 - loss: 5.2081 - categorical_accuracy: 0.164 - ETA: 26:32 - loss: 5.1721













Epoch 5/50


 4512/35644 [==>...........................] - ETA: 26:36 - loss: 4.7688 - categorical_accuracy: 0.281 - ETA: 26:29 - loss: 5.0434 - categorical_accuracy: 0.203 - ETA: 26:24 - loss: 4.7477 - categorical_accuracy: 0.239 - ETA: 26:27 - loss: 4.7995 - categorical_accuracy: 0.250 - ETA: 26:28 - loss: 4.8284 - categorical_accuracy: 0.250 - ETA: 26:24 - loss: 4.9861 - categorical_accuracy: 0.224 - ETA: 26:23 - loss: 4.8456 - categorical_accuracy: 0.232 - ETA: 26:23 - loss: 4.8209 - categorical_accuracy: 0.234 - ETA: 26:19 - loss: 4.7384 - categorical_accuracy: 0.236 - ETA: 26:19 - loss: 4.6906 - categorical_accuracy: 0.237 - ETA: 26:21 - loss: 4.7348 - categorical_accuracy: 0.233 - ETA: 26:18 - loss: 4.6944 - categorical_accuracy: 0.242 - ETA: 26:18 - loss: 4.7396 - categorical_accuracy: 0.245 - ETA: 26:15 - loss: 4.8388 - categorical_accuracy: 0.232 - ETA: 26:15 - loss: 4.8554 - categorical_accuracy: 0.231 - ETA: 26:16 - loss: 4.8057 - categorical_accuracy: 0.230 - ETA: 26:14 - loss: 4.7641













Epoch 6/50


 4512/35644 [==>...........................] - ETA: 26:59 - loss: 4.9489 - categorical_accuracy: 0.218 - ETA: 26:35 - loss: 4.5190 - categorical_accuracy: 0.234 - ETA: 26:45 - loss: 4.4792 - categorical_accuracy: 0.208 - ETA: 26:39 - loss: 4.5178 - categorical_accuracy: 0.203 - ETA: 26:42 - loss: 4.5537 - categorical_accuracy: 0.200 - ETA: 26:40 - loss: 4.5106 - categorical_accuracy: 0.208 - ETA: 26:39 - loss: 4.4510 - categorical_accuracy: 0.209 - ETA: 26:39 - loss: 4.5045 - categorical_accuracy: 0.210 - ETA: 26:39 - loss: 4.5276 - categorical_accuracy: 0.211 - ETA: 26:36 - loss: 4.5126 - categorical_accuracy: 0.209 - ETA: 26:37 - loss: 4.5252 - categorical_accuracy: 0.210 - ETA: 26:37 - loss: 4.5286 - categorical_accuracy: 0.213 - ETA: 26:36 - loss: 4.5382 - categorical_accuracy: 0.209 - ETA: 26:35 - loss: 4.5055 - categorical_accuracy: 0.221 - ETA: 26:34 - loss: 4.5338 - categorical_accuracy: 0.227 - ETA: 26:33 - loss: 4.5301 - categorical_accuracy: 0.230 - ETA: 26:33 - loss: 4.5691













Epoch 7/50


 4512/35644 [==>...........................] - ETA: 26:21 - loss: 4.5557 - categorical_accuracy: 0.312 - ETA: 26:34 - loss: 4.6925 - categorical_accuracy: 0.265 - ETA: 26:23 - loss: 4.6387 - categorical_accuracy: 0.250 - ETA: 26:29 - loss: 4.7489 - categorical_accuracy: 0.242 - ETA: 26:31 - loss: 4.8554 - categorical_accuracy: 0.225 - ETA: 26:30 - loss: 4.8263 - categorical_accuracy: 0.229 - ETA: 26:28 - loss: 4.7542 - categorical_accuracy: 0.227 - ETA: 26:26 - loss: 4.7781 - categorical_accuracy: 0.218 - ETA: 26:26 - loss: 4.7067 - categorical_accuracy: 0.211 - ETA: 26:23 - loss: 4.6698 - categorical_accuracy: 0.209 - ETA: 26:19 - loss: 4.6166 - categorical_accuracy: 0.213 - ETA: 26:19 - loss: 4.5488 - categorical_accuracy: 0.221 - ETA: 26:19 - loss: 4.5671 - categorical_accuracy: 0.230 - ETA: 26:17 - loss: 4.6162 - categorical_accuracy: 0.227 - ETA: 26:15 - loss: 4.6143 - categorical_accuracy: 0.229 - ETA: 26:15 - loss: 4.5503 - categorical_accuracy: 0.236 - ETA: 26:13 - loss: 4.5510













Epoch 8/50


 4512/35644 [==>...........................] - ETA: 26:58 - loss: 3.6878 - categorical_accuracy: 0.343 - ETA: 26:42 - loss: 3.8797 - categorical_accuracy: 0.296 - ETA: 26:45 - loss: 3.9191 - categorical_accuracy: 0.333 - ETA: 26:38 - loss: 4.0752 - categorical_accuracy: 0.296 - ETA: 26:38 - loss: 4.0615 - categorical_accuracy: 0.300 - ETA: 26:39 - loss: 4.2610 - categorical_accuracy: 0.270 - ETA: 26:36 - loss: 4.1533 - categorical_accuracy: 0.281 - ETA: 26:33 - loss: 4.0731 - categorical_accuracy: 0.285 - ETA: 26:33 - loss: 4.1089 - categorical_accuracy: 0.277 - ETA: 26:31 - loss: 4.1927 - categorical_accuracy: 0.265 - ETA: 26:26 - loss: 4.1156 - categorical_accuracy: 0.269 - ETA: 26:26 - loss: 4.1535 - categorical_accuracy: 0.260 - ETA: 26:23 - loss: 4.2394 - categorical_accuracy: 0.254 - ETA: 26:21 - loss: 4.2026 - categorical_accuracy: 0.258 - ETA: 26:19 - loss: 4.1813 - categorical_accuracy: 0.262 - ETA: 26:18 - loss: 4.1732 - categorical_accuracy: 0.263 - ETA: 26:18 - loss: 4.1487















In [27]:
save_dir

'save'

In [28]:
#save the model
md.save(save_dir + "/" + 'my_model_gen_sentences_lstm.final.hdf5')

# Generate phrase

Great !
We have now trained a model to predict the next word of a given sequence of words. In order to generate text, the task is pretty simple:

 - we define a "seed" sequence of 30 words (30 is the number of words required by the neural net for the sequences),
 - we ask the neural net to predict word number 31,
 - then we update the sequence by moving words by a step of 1, adding words number 31 at its end,
 - we ask the neural net to predict word number 32,
 - etc. For as long as we want.
 
Doing this, we generate phrases, word by word.

In [69]:
#load vocabulary
print("loading vocabulary...")
vocab_file = os.path.join(save_dir, "words_vocab.pkl")

with open(os.path.join(save_dir, 'words_vocab.pkl'), 'rb') as f:
        words, vocab, vocabulary_inv = cPickle.load(f)

vocab_size = len(words)

loading vocabulary...


In [70]:
from keras.models import load_model
# load the model
print("loading model...")
model = load_model(save_dir + "/" + 'my_model_gen_sentences_lstm.final.hdf5')

loading model...


To improve the word generation, and tune a bit the prediction, we introduce a specific function to pick-up words.

We will not take the words with the highest prediction (or the generation of text will be boring), but would like to insert some uncertainties, and let the solution sometime pick-up words with less good prediction.

That is the purpose of the function **sample**, that will draw radomly a word from the vocabulary.

The probabilty for a word to be drawn will depends directly on its probability to be the next word. In order to tune this probability, we introduce a "temperature" to smooth or sharpen its value.

In [71]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [87]:
#initiate sentences
# seed_sentences = "nolan avance sur le chemin de pierre et grimpe les marches ."
seed_sentences = "it is the end of the world"
generated = ''
sentence = []
for i in range (seq_length):
    sentence.append("a")

seed = seed_sentences.split()

for i in range(len(seed)):
    sentence[seq_length-i-1]=seed[len(seed)-i-1]

generated += ' '.join(sentence)
print('Generating text with the following seed: "' + ' '.join(sentence) + '"')

print ()

Generating text with the following seed: "a a a a a a a a a a a a a it is the end of the world"



In [88]:
words_number = 4000


#generate the text
for i in range(words_number):
    #create the vector
    x = np.zeros((1, seq_length, vocab_size))
    for t, word in enumerate(sentence):
        x[0, t, vocab[word]] = 1.
    #print(x.shape)

    #calculate next word
    preds = model.predict(x, verbose=0)[0]
#     print(preds)
    next_index = sample(preds, 0.34)
#     print(next_index)
    next_word = vocabulary_inv[next_index]
#     print(next_word)

    #add the next word to the text
    generated += " " + next_word
    # shift the sentence by one, and and the next word at its end
    sentence = sentence[1:] + [next_word]

print(generated)


  after removing the cwd from sys.path.


a a a a a a a a a a a a a it is the end of the world , a man down , the eye on the floor .   he 's small man is looks with a glass . 

 int .   kelvin 's room 





 anderson 's room 





 a body and guard is , a woman , runs in a man 's on the image , the front door , and and the man .   a man 's has moment . 

 max 's building - night 





 caleb 's hands , the tall man 's , joel 's phone flies up , and looks at the window .   he is looks up .   he is a few in the and the open only helicopter is computer . 





 the desk of the metal is are .   she 's right first , the monitor .   he is not 





 the body of the gun and is an female .   he sees the monitor .   he he looks at the brain . 





 the big " his head . int .   street ( matrix ) - day 





 the monitor - night 





 the room , the eyes .   it is in a " a - sack on the door .   he gets up . he gets up and sees the head .   a man in a man , on a red , a small dark .   the two door is empty .   a train of in the metal t

 a other officer - high as they window with - not ball


In [89]:
with open("from_several_movies_4000_9.txt", "w") as text_file:
    text_file.write(generated)

In [43]:
generated

"a a a a a a a a a a a a a a a a a it is raining . \n\n\n\n\n\n a sound of a gun - face front of the glass .   the man is gone . \n\n\n\n\n\n neo 's long , it is about to be not , a face , just in a moment , the cover the door and but max 's back and nothing . max 's day \n\n\n\n\n\n the crowd , the crowd is int .   anderson 's room \n\n\n\n\n\n the only move , the phone , he is door .   she is cut to a man 's door is across the man . \n\n\n\n\n\n int .   kelvin 's room \n\n\n\n\n\n a sound of a room . \n\n int .   hovercraft \n\n\n\n\n\n the beat is max 's area - a night - the dark , picks out it is a . and a red time . \n\n int .   street ( matrix ) - day \n\n\n\n\n\n the corn field , a face of the woman , and anderson 's face . \n\n\n\n\n\n int . paris streets - day \n\n\n\n\n\n the car of it 's eyes .   he turns to a other red screams .   the window .   he does n't see a eyes . \n\n\n\n\n\n int .   kelvin 's room \n\n\n\n\n\n a helicopter is , witwer , and it 's eyes .   a monitor 

In [34]:
vocab

{'\n\n': 0,
 '\n\n\n': 1,
 '\n\n\n\t': 2,
 '\n\n\n\t\t\t': 3,
 '\n\n\n\t\t ': 4,
 '\n\n\n\t       ': 5,
 '\n\n\n\n': 6,
 '\n\n\n\n\n': 7,
 '\n\n\n\n\n\t': 8,
 '\n\n\n\n\n\n': 9,
 '\n\n\n\n\n\n\t': 10,
 '\n\n\n\n\n\n\t\t\t\t\t': 11,
 '\n\n\n\n\n\n\t\t\t\t\t\n\n\n\t': 12,
 '\n\n\n\n\n\n\t\t\t\t\t\n\n\n\n\n\n\t': 13,
 '\n\n\n\n\n\n\t\t\t\t\t\n\n\n\n\n\n\n\n\n\t': 14,
 '\n\n\n\n\n\n\t\t\t\n\n\n\n\n\n\t': 15,
 '\n\n\n\n\n\n\t       ': 16,
 '\n\n\n\n\n\n\n\t': 17,
 '\n\n\n\n\n\n\n\n': 18,
 '\n\n\n\n\n\n\n\n\t': 19,
 '\n\n\n\n\n\n\n\n\n': 20,
 '\n\n\n\n\n\n\n\n\n\t': 21,
 '\n\n\n\n\n\n\n\n\n\n\n\n               ': 22,
 '\n\n\n\n\n\n\n\n\n     ': 23,
 '\n\n\n\n\n\n ': 24,
 '\n\n\n\n\n\n \n\n\n          ': 25,
 '\n\n\n\n\n\n    ': 26,
 '\n\n\n\n\n\n     ': 27,
 '\n\n\n\n\n\n      ': 28,
 '\n\n\n\n\n\n          ': 29,
 '\n\n\n\n\n\n               ': 30,
 '\n\n\n\n\n\n                         ': 31,
 '\n\n\n\n\n\n                            ': 32,
 '\n\n\n\n\n\n                                   

In [109]:
vocab

{'\n\t\t': 0,
 '\n\t\t\t': 1,
 '\n\t\t\t\t\t': 2,
 '\n\t\n\t': 3,
 '\n\t\n\t\t\t\t': 4,
 '\n\n\t': 5,
 '\n\n\t\t': 6,
 '\n\n\t\t\t': 7,
 '\n\n\t\t\t\t': 8,
 '\n\n\t\t\t\t\t': 9,
 '\n\n\n\t': 10,
 '\n\n\n\t\t\t\t': 11,
 '\n\n    ': 12,
 '\n \n\t\t\t\t': 13,
 '\n    ': 14,
 '!': 15,
 '"': 16,
 '"crazy': 17,
 '"germs': 18,
 '#': 19,
 '&': 20,
 "'": 21,
 "'92": 22,
 "'cause": 23,
 "'d": 24,
 "'em": 25,
 "'ll": 26,
 "'m": 27,
 "'re": 28,
 "'s": 29,
 "'ve": 30,
 '(': 31,
 ')': 32,
 ',': 33,
 '-': 34,
 '--': 35,
 '-----------------------': 36,
 '--number': 37,
 '--this': 38,
 '.': 39,
 '..': 40,
 '...': 41,
 '....': 42,
 '/': 43,
 '1': 44,
 '1162': 45,
 '12': 46,
 '150': 47,
 '17': 48,
 '1841': 49,
 '18th': 50,
 '1917': 51,
 '1989': 52,
 '1995': 53,
 '2': 54,
 '200': 55,
 '20th': 56,
 '233': 57,
 '25': 58,
 '2nd': 59,
 '3': 60,
 '38': 61,
 '4': 62,
 '46': 63,
 '52': 64,
 '5429': 65,
 '65': 66,
 '66578': 67,
 '7': 68,
 '747': 69,
 '784': 70,
 '8': 71,
 '87645': 72,
 '89': 73,
 '8oo': 74,
 '9':