# Exemple de création d'un réseau de neurone LSTM (Long Short Term Memory) pour la *génération automatique de texte*

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
import numpy as np
import random
import sys

Using TensorFlow backend.


On commence par importer les données (fichier texte contenant une concaténation de plusieurs livres de Nietzsche).

In [2]:
path = '../data/external/nietzsche.txt'
text = open(path).read().lower()
print('corpus length:', len(text))

corpus length: 600901


Création du dictionnaire des sigles (ici principalement des lettres/chiffres/ponctuations) rencontrés.

In [3]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
print(chars)

total chars: 59
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¤', '¦', '©', '«', 'ã', '†']


Couper le texte en séquences semi-redondantes de longueur 'maxlen'

In [4]:
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 200287


Vectorisation (création des matrices sur lesquelles on va concrètement entrainer le modèle).

In [5]:
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

La création du réseau LSTM à proprement parler : un LSTM à une couche.

In [6]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars)))) # LSTM layer de 128 units
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01) # methode de gradient descent
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


Définition d'une fonction pour sélectionner la lettre suivante à partir de la distribution de probabilités donné par le réseau LSTM.

In [7]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    # the prob array here is a vector with one prob for each letter (59 items with Nietzche)
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Et finalement, la phase d'entrainement du réseau, que l'on va faire "glisser" sur les données extraites du fichier txt.

A la fin de chaque passage sur le texte, on génère un bout de texte à partir d'un seed (voir exemple de résultats dans les slides de présentation).

In [None]:
# train the model, output generated text after each iteration
for iteration in range(1, 30):
    print()
    print('-' * 50)
    print('Iteration', iteration)

    model.fit(X, y, batch_size=128, epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        #sentence = text[start_index: start_index + maxlen]
        sentence = "will cgi get a data science project soon"
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.  # one hot encoding

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


--------------------------------------------------
Iteration 1
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon the porits and the more and the sense of the personation of the perponsely and the propent the perponselless and such prowled the some and the deceration of the prease of the precession of the inderent and the all the pretent to the perponsely the seed the the pospless of prease of the porital the preceds and in the present and the present the possicion of the prease of the porits of the pores th

----- diversity: 0.5
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon sach periodance and the and too stall the incodved the seen the interponical the strengh of cares
in the sense to every deceality of the the personately and the haster to the will think to the perplened to porical and completo the man poresting to treats and

will cgi get a data science project soon the the say that the suppreciation of the contemptions of the satisfier the conscience of the conscience of the supposition of the struggle and the struggle the struggle the stronger the strives the stronger to the stronger the strength to the struggle the conscience of the strength to the satisfaction is the strength to the strength in the struggle the strength the strength is the contemptions a

----- diversity: 0.5
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon the man is the short, there is always the sign the stronger as on his strength is to him and destiny and decogres in the still nature of sensives are the contrast the strugged and dangerous good existement of that which the cortser to the spirit, there is not a purity of the comprehending possible be means of the end the stride of interprence the christian nature of the spirits of the mean except

----- diversity: 1.0
---

  """


 fururblat painful ryengare, remain appears, what without has leave by that fear-wominatie, thir also addics without in whom thingsered, is we eghe
nair tened in the evil to theabar ly places  its proval inflicts it is
envoicance for one domain my preferably unconclusion" in their
propity, to your enemously to selfence from force for
there is also hamres three, as again indives ask wit

----- diversity: 1.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon. among pals moral "christian go
stancing preferatively
grown as it lapadiydles one element kinale but uursen of "soul, no hads" from lifeful spirital,y to
theirvoust phtied agay might the
virtoned and asseath is understandsjosing, it is everylicoic ow-sequal: as a man as
fail that it is as lieed, to preasents--to his beiner,  development, now sense, nationarily fin is enough woman, whereen signiq

--------------------------------------------------
Iteration 11
Epoch 1/1

-