# Exemple de création d'un réseau de neurone LSTM (Long Short Term Memory) pour la *génération automatique de texte*

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
import numpy as np
import random
import sys

Using TensorFlow backend.


On commence par importer les données (fichier texte contenant une concaténation de plusieurs livres de Nietzsche).

In [2]:
path = '../data/nietzsche.txt'
text = open(path).read().lower()
print('corpus length:', len(text))

corpus length: 600893


Création du dictionnaire des sigles (ici principalement des lettres/chiffres/ponctuations) rencontrés.

In [3]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
print(chars)

total chars: 57
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ä', 'æ', 'é', 'ë']


Couper le texte en séquences semi-redondantes de longueur 'maxlen'

In [4]:
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 200285


Vectorisation (création des matrices sur lesquelles on va concrètement entrainer le modèle).

In [5]:
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

La création du réseau LSTM à proprement parler : un LSTM à une couche.

In [6]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars)))) # LSTM layer de 128 units
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01) # methode de gradient descent
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


Définition d'une fonction pour sélectionner la lettre suivante à partir de la distribution de probabilités donné par le réseau LSTM.

In [7]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    # the prob array here is a vector with one prob for each letter (59 items with Nietzche)
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Et finalement, la phase d'entrainement du réseau, que l'on va faire "glisser" sur les données extraites du fichier txt.

A la fin de chaque passage sur le texte, on génère un bout de texte à partir d'un seed (voir exemple de résultats dans les slides de présentation).

In [8]:
# train the model, output generated text after each iteration
for iteration in range(1, 30):
    print()
    print('-' * 50)
    print('Iteration', iteration)

    model.fit(X, y, batch_size=128, epochs=3)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        #sentence = text[start_index: start_index + maxlen]
        sentence = "will cgi get a data science project soon"
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.  # one hot encoding

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


--------------------------------------------------
Iteration 1
Epoch 1/3
Epoch 2/3
Epoch 3/3

----- diversity: 0.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon is the present the strength of the strange is the some of the strange the such a conscious of the strange of the some of the some of the some some of the some the some the some of good is the some the strength and is the such a some of the strength of the strength is the strength of the strength of the some of the belief and all the some the sension of the solent of the strange the soleng and whi

----- diversity: 0.5
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon and so of strange the constantly the far that the superstand and all are as it is his oppers and and of the constantly so as the conception and itself be the strength
of the some of the personal of the sensife itself in the conscience of 

will cgi get a data science project soon that for the strange a strength to be states of a sense and something and satisfie the portrant to be morals
or extent of his moral does not stand standard of new any taste and life, and as to say, and strength in the sight of grant and the strive of the sense, the strong the soul, as the new aristocret of every forms of and inderens for the stacds to profound the
sense of restide partiching of t

----- diversity: 1.0
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soong for
manifest and guilt that isen",
they have as organs of and joyselfy wang for he chang,
septime one one paro, and inspire and hundews things of priberves. but
dure so freemnrounh, respectly doubt is the corrued and good. as sometiming of as the their cludity successituiness of the s
uteric himself precisely, they
lay,
the seems and
purpime an un"crummen love," many limility what allowess. but 

----- diversity: 1.2
---

  """


al.

12eedt. was not a light in ustiffulle explorable its as for the nephing. but could down upon
the desire upon suncu. laking and dag;bying. stroing?
 fon ourishlesing busnilad.--the on wisesess

--------------------------------------------------
Iteration 5
Epoch 1/3
Epoch 2/3
Epoch 3/3

----- diversity: 0.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon the spirit in the their and and the self-despise and and own the procence and all the saint the subjection of the interpretation of the sense of the spiritual of the sense of the sense of the spiritual the subjection of the the subjection of the old the subjection of the self-experiences of self--and and as a spirit in the subjection of the partic of the morality of the spiritual and the power an

----- diversity: 0.5
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soons is preaces, the more and and the the pro

will cgi get a data science project soones and and and and and at the self-desire and the standards and more and attempt of the self-depth and all the self-desire the and long and the self-satisfort and constant the self-desire the conditions and still and and the conditional the self-instinct of the soul is the standards to the conditional and and wholly and a state the self-master and origin and a standard of the self-some and protect

----- diversity: 0.5
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soonity of a the real fundamental or in the self-individual them. the self-devil is a self-present and afternes and protecting of the most in a something thems that the subjecting is the interestance of the nature and mankind or the good of the intermion of the art of nature at self-self-individual historic (to be consideration of the more of the high almost in the sentiment and morality of his deal t

----- diversity: 1.0
---

of the eneel[éëéééééééé
téém thééééé éäéééééé éw a steper anmo té sééf ved, aä, a dené éébéééé
pééé éféé éé téédéé thééé ééd a stimile or t

----- diversity: 1.0
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soonééëéééé éééééé ébéééééébæéäéäér séäé as cin thimpt ngre vmating intalrtot théééééé æéééé äéééé éæémégsæéédé ëd théw his
cear not wh onn , é
himpearéméëéé éééé,ééééééérééééé
paévelééæäb-eremo ecphish to ece to9ediéé no todédéééééééé éééæééæé,éhéwéég whoseéééés of the ertiss the in qéméyéééééve éstééæbééééæéféééäééééyé, wasécéps, usurh usons the he
wersfréé.ééééémééé ééëéæäéééæé
éééé ééëéé
of the en

----- diversity: 1.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soonééæéééé,æëëééé
é æëéééé
ææééäé icëéæapons no sro ulitee heer rutongerëéëé
inæééfeéëééééé éæéës éëirééæäé wædd, thsturuin hy nol a photons néimiééé,éééééé
ééé ééééééctéæbééé,é-sellité
for inreiv

will cgi get a data science project soono o o a  (s aereait  a i it oeot o eoeon i iu air t eeterhe ole e  oo sere   hi as in -is  e nrxt  a tt ire t ts inye erea   ae .eh(re st  he ondeeite t ee hee  e oue ro iie  eer (ao !hes eas eare   éis elen lr i  o te n ae  ar' in a tatoe   an  onat   e tte     soo a n her  svo ore  es  t ieste terat o   iirf  i) ane   on t hestithesat t  s   éte t  pe oner te fot e eris t s a oe   e neren  oe a 

--------------------------------------------------
Iteration 15
Epoch 1/3
Epoch 2/3
Epoch 3/3

----- diversity: 0.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soone te   t  ot hooe eerrere ae ie ate atat an o oe  an aoe   ths ast  hot os etie (rtes ae be t o ete are  athe  ooe a we ith  at  e ot c- oe t oo i in  be tzhen a  att nn alethjthes it ne  ae a  th oonee atert oeere t are  th the  o atee oe earea io ekte  ie an ter   to t   itde anr  e ae aon to  tetwe ane  ot tve atere are it  al

will cgi get a data science project soon  ifs a tritite inre ter e o asket  onine tente tee ten al eat tin  an ase an snew eerat as   he  ot ati the a  etqe 
att i  as  nit ese tetate e i  t it e ait itt e et iso  erleswes tirtele ton ant t  ase ie e er iuhinéns aht itoerer at ase ter o  o alit  an  ie t hesis it in ae eeree erlen ass re t rn ah its e ]a t is  on anwree  bes an aene e ao al is a itino anstn  at athe ar qsje at ist iteee

----- diversity: 0.5
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon rint as inere at is tlene re e esf  h eheral a oe altenererin is l! toi a artfe es as sxeisreisn or es e tit ees aosen teotel otneises ri is an itl ae ith o toe th  ibo e  noile en  ore toniw se ouit nourinithee a ret eani seurnst sithe ise haininiti isesne  oter ionionr inastere turiohto e s e ers orit ea aneasp tini ire ot  rrtasf o itin hisas eusta  e inor isrerareneee ane iseto inta or ean ae

----- diversity: 1.0
---

will cgi get a data science project soon ere t ins e e re therees  leres   on untr an t athh e he theon thrrlne ther  slas this  ots t athora iree as ir tho  esr  t t  e orer  ie or one art tuee are  l rs  ; iel test ae thee er   i:tit oin o r thirh o thsn eris or nse thannt ese ste   eun ter s tone a s these  the  thun to rs thenetan  te trners enthe  he neetorr t t t ethernrsel he ther s as setin athe onrun roe  s oe thete entou ih ie

----- diversity: 1.0
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soonaoeso i!in o arr n os ats tet eralineours rsithntathtaea iosu eset    tteethe enerc ate rn oen en ttean  erhor soeooiletne ur a
ol s e n tosi i   oharte ot tu snse trenntae rtsaas  on  ureei azels leru oe on ause thrrnerthreth outu ettion  thn erouilnteelenits r eeietoo lorhins tee sseiatrlers ertinlslerannt   atle t so lu sse  ouseelo rs thoninoithtutesteolirnasitislo ans   is lei thel nurrot sor

----- diversity: 1.2
---

will cgi get a data science project soon te onetion  ee insot oea atn sint siionl anstht so asart n irutisuntoete slt u ansnseilotiot ie  aalo aleithittiliuthtitr ist anttai slie t sonatlatas olsot oriellt th oistaa totnens nlu s tss int hoeteiil t l toerinlettotn inrite lo tae ossyante titheert ann hee iiontht tote sit sealsit terieiseiu ineoathe tiiooaoo  litoteitethatoooatenaae onnlaisseorlitat es its thesstaisasntoinlsess lueh r tli

----- diversity: 1.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon tst hes toraluluiit liest intinr otlth iln otsrl otl ittoun h alauti that l estelntl nosiisr olis hhe sul  rontlloi hrote e  i eai rt    thoooss tolthlsiri aiilitnntttith  tie tho ine intuelt rs terst ototte onnuatelisus tontethuetoslnt s sol ils thitsreli osnaltis  a hentettiu oaltaith iosnnsiosioaneerettieiol tian auotinls tiet eat it tiroes  stithirathasts ateli unneosssentu tir  iou ulienthns

------------------------

will cgi get a data science project soonrseelon saterietntu thouunstaitit in en ust es esnieinn telr snaheesue anat rlan aes   iies tnset einaties uleailtlil tenaante tsre toshaltestasstsates   ns untholasouorianeut i rrot ene an urs rnn i  tar usss aste tronthutinlitetiiaeai hielui tnn ll  hue ineon an s srs rs t eh heere aseeun isete nanert toh  rheeatissittis thiutssurnanaan ae e tr anos aeothnriethliso ins usnatuttu uuose thsiie tit

--------------------------------------------------
Iteration 28
Epoch 1/3
Epoch 2/3
Epoch 3/3

----- diversity: 0.2
----- Generating with seed: "will cgi get a data science project soon"
will cgi get a data science project soon th an  an  an ie te at the th s thet th  t thene ten in  se t thert thina n  in an ie t t thn  n th t  an  n te aan t the tho aeren  anthi i iee e the en thent  the t thern ion t th ient t thena  ense th the an an ro ten  it  in  an  theit hher a the t th th thietes then t th   in a n  in  anea oe the t thehe te t th thnale th t