# Character Level RNN using LSTM cells.

- Trains on Star Trek episode titles
- Outputs "fake" titles.

Much comes from a [Keras example](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py).

## Setup Environment

- Import Keras
- Open up the Star Trek corpus
- Give each leter an index and create dictionaries to translate from index to character.

In [1]:
## Much borrowed from https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM, Dropout
from keras.layers.embeddings import Embedding
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras.models import load_model
import numpy as np
import random
import sys

text = open("startrekepisodes.txt").read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
vocabulary_size = len(chars)
print('total chars:', vocabulary_size)
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Using TensorFlow backend.


corpus length: 12106
total chars: 55


In [2]:
# How long is a title?
titles = text.split('\n')
lengths = np.array([len(n) for n in titles])
print(np.max(lengths))
print(np.mean(lengths))
print(np.median(lengths))
print(np.min(lengths))

# hence choose 30 as seuence length to train on.

51
15.4497282609
15.0
3


## Setup Training Data

- Cut up the corpus into sequences of 40 characters.
- Change indexes into "one-hot" vector encodings.

In [3]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 30
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

X = np.zeros((len(sentences), maxlen), dtype=int)
y = np.zeros((len(sentences), vocabulary_size), dtype=np.bool)

for i in range(len(sentences)):
    X[i] = np.array(map((lambda x: char_indices[x]), sentences[i]))
    y[i, char_indices[next_chars[i]]] = True
print("Done converting y to one-hot.")
print("Done preparing training corpus, shapes of sets are:")
print("X shape: " + str(X.shape))
print("y shape: " + str(y.shape))
print("Vocabulary of characters:", vocabulary_size)

nb sequences: 4026
Done converting y to one-hot.
Done preparing training corpus, shapes of sets are:
X shape: (4026, 30)
y shape: (4026, 55)
Vocabulary of characters: 55


## Model

- Model has one hidden layer of 128 LSTM cells.
- Input layer is an Embedding to convert from indices to a vector encoding automatically (common trick - but does it work?)

In [7]:
layer_size = 256
dropout_rate = 0.5
# build the model: a single LSTM
print('Build model...')
model_train = Sequential()
model_train.add(Embedding(vocabulary_size, layer_size, input_length=maxlen))

# LSTM part
model_train.add(Dropout(dropout_rate))
model_train.add(LSTM(layer_size, return_sequences=True))
model_train.add(Dropout(dropout_rate))
model_train.add(LSTM(layer_size))

# Project back to vocabulary
model_train.add(Dense(vocabulary_size))
model_train.add(Activation('softmax'))
model_train.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01))
model_train.summary()



Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 30, 256)           14080     
_________________________________________________________________
dropout_3 (Dropout)          (None, 30, 256)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 30, 256)           525312    
_________________________________________________________________
dropout_4 (Dropout)          (None, 30, 256)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_2 (Dense)              (None, 55)                14135     
_________________________________________________________________
activation_2 (Activation)    (None, 55)                0     

## Training

- Train on batches of 128 examples

In [8]:
# Training the Model.
model_train.fit(X, y, batch_size=64, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x11d249c10>

In [16]:
model_train.fit(X, y, batch_size=64, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x124bdca10>

In [17]:
# Save model if necessary
model_train.save("keras-startrek-LSTM-model.h5")

## Test the Model

- Take a quote then add 400 characters.

### Make a Decoder model

- Needs input length of 1.
- Needs batch size of 1
- Needs LSTM to be stateful
- check that params is the same as model_train

In [32]:
# Load model if necessary.
model_train = load_model("keras-startrek-LSTM-model.h5")

In [14]:
# Build a decoding model (input length 1, batch size 1, stateful)
layer_size = 256
dropout_rate = 0.5

model_dec = Sequential()
model_dec.add(Embedding(vocabulary_size, layer_size, input_length=1, batch_input_shape=(1,1)))

# LSTM part
model_dec.add(Dropout(dropout_rate))
model_dec.add(LSTM(layer_size, stateful=True, return_sequences=True))
model_dec.add(Dropout(dropout_rate))
model_dec.add(LSTM(layer_size, stateful=True))

# project back to vocabulary
model_dec.add(Dense(vocabulary_size))
model_dec.add(Activation('softmax'))
model_dec.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01))
model_dec.summary()

# set weights from training model
model_dec.set_weights(model_train.get_weights())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (1, 1, 256)               14080     
_________________________________________________________________
dropout_7 (Dropout)          (1, 1, 256)               0         
_________________________________________________________________
lstm_7 (LSTM)                (1, 1, 256)               525312    
_________________________________________________________________
dropout_8 (Dropout)          (1, 1, 256)               0         
_________________________________________________________________
lstm_8 (LSTM)                (1, 256)                  525312    
_________________________________________________________________
dense_4 (Dense)              (1, 55)                   14135     
_________________________________________________________________
activation_4 (Activation)    (1, 55)                   0         
Total para

In [11]:
## Sampling function

def sample_model(seed, model_name, length=400):
    '''Samples a charRNN given a seed sequence.'''
    generated = ''
    sentence = seed.lower()[:]
    generated += sentence
    print("Seed: ", generated)
    
    for i in range(length):
        x = np.array(map((lambda x: char_indices[x]), sentence))
        x = np.reshape(x,(1,1))
        preds = model_name.predict(x, verbose=0)[0]
        next_index = sample(preds, 0.5)
        next_char = indices_char[next_index]
        
        generated += next_char
        sentence = sentence[1:] + next_char
    print("Generated: ", generated)

In [15]:
# Sample 1000 characters from the model using a random seed from the vocabulary.
sample_model(indices_char[random.randint(0,vocabulary_size-1)], model_dec, 1000)

Seed:  t




Generated:  tj� pegand strose 
the corgury.....................................................................................................................................................................................................................................  
sivanizg  
the domparton
the magder
emmomang  
sever  
mane of the lorgh dore   
the surgeruch   
the exignt of the lorder  
the coust of gluyy  
the surderly  
the pegegaly  
the wrourty  
cork of helle  
the sutuve upthe mand dommolity
the pegond trity
vegger
in the curdery.......................  
shapricition
wattle sever 
bligitiver
inf clund twing 
the sight   
come  
the vile  
the cormaght  
the vivader
the pist cist  
the couth 
juttle lord  
cork of ther lordh douth dorgay: part ii   
the suyd of the lorross and stroch  
the auigignty 
manger-y
the magded of the proch dold
the sivigadaser
in the cormand twuty? part ii   
the dissce  
baller couthe coust of blesuun  
sepent'r   
denagale fargor, part ii  
th