**Importing the required libraries.**

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import LambdaCallback

**Creating the vocabulary** <br>
We read the data from the text file and split it into a list of uniques charachters which will be used as the dictionary for the model. We can see that we have a total of 19909 charachters in our data and 27 unique characters. The characters are a-z (26 characters) plus the "\n" (or newline character). 

In [11]:
data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 19909 total characters and 27 unique characters in your data.


Next, we split the data with respect to the \n characters so as to obtain a a list of the names which will act as an input to our model. We add a "." at the end which will act as an EOS token. 

In [None]:
names=data.split()
names

In [18]:
names = list(map(lambda s: s + '.', names))
names[:10]

['aachenosaurus.',
 'aardonyx.',
 'abdallahsaurus.',
 'abelisaurus.',
 'abrictosaurus.',
 'abrosaurus.',
 'abydosaurus.',
 'acanthopholis.',
 'achelousaurus.',
 'acheroraptor.']

In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character.

In [19]:
char_to_index = dict( (chr(i+96), i) for i in range(1,27))
char_to_index[' '] = 0
char_to_index['.'] = 27

# Convert from index to character
index_to_char = dict( (i, chr(i+96)) for i in range(1,27))
index_to_char[0] = ' '
index_to_char[27] = '.'

We create some variable to store the maximum length of a possible name, total number of names in the input data and the number of characters in our vocabulary.

In [23]:
max_char = len(max(names, key=len))
m = len(names)
char_dim = len(char_to_index)

**TRAINING DATA** <BR>
We create the training data set by initializing two zero matrices, one as the input and other as th eexpected output characters. For each of the m names in our dataset, we create a 2 dimensional matrix. Each matrix contains a row for each character in the name. (Note that there are always the same number of rows and if the name doesn't have enough characters to fill the whole matrix the remaining rows contain nothing.) Each of these rows represents one character and it is encoded as a one-hot vector. This means that it is a vector of zeros with a one only in the entry that corresponds to the character that is present.

The output Y is the same as the input but translated by one unit. This means that the ith character in Y is the (i+1)th one in the actual name. This means that the network predicts the character that follows a given character in a sequence.

In [37]:
X = np.zeros((m, max_char, char_dim))
Y = np.zeros((m, max_char, char_dim))

for i in range(m):
    name = list(names[i])
    for j in range(len(name)):
        X[i, j, char_to_index[name[j]]] = 1
        if j < len(name)-1:
            Y[i, j, char_to_index[name[j+1]]] = 1

**MODEL WITH A LSTM LAYER**
<BR>
In the case of interest here we only consider one layer of recurrence, which we take to be LSTM with 128 units. We return the output of this layer and use it into a fully connected dense layer that converts the result of the LSTM layer into a vector of size char_dim using a softmax activation. We use categorical cross entropy as a cost function because of the softmax result and use Adam optimization. There is not really any useful metric to judge if the model does good so we will mostly just look at the results.

In [38]:
model = Sequential()
model.add(LSTM(128, input_shape=(max_char, char_dim), return_sequences=True))
model.add(Dense(char_dim, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

We create a function to generate new names. We input an empty character to the model to predict a random output for the first time step that will be further used by the next layer as its input. We normalize the probabilities of occurence of each character and then randomly pick one of the most probable characters. "." character will signify the end of the name.

In [39]:
def make_name(model):
    name = []
    x = np.zeros((1, max_char, char_dim))
    end = False
    i = 0
    
    while end==False:
        probs = list(model.predict(x)[0,i])
        probs = probs / np.sum(probs)
        index = np.random.choice(range(char_dim), p=probs)
        if i == max_char-2:
            character = '.'
            end = True
        else:
            character = index_to_char[index]
        name.append(character)
        x[0, i+1, index] = 1
        i += 1
        if character == '.':
            end = True
    
    print(''.join(name))

In [40]:
 def generate_name_loop(epoch, _):  
    if epoch % 25 == 0:
        
        print('Names generated after epoch %d:' % epoch)

        for i in range(3):
            make_name(model)
        
        print()

Now we want to use this function during the training to monitor how the generated names get better. To this end we create a function that will be given to the model when we fit it. We basically run the previous function a few times every 50 epochs and print the results.

In [41]:
name_generator = LambdaCallback(on_epoch_end = generate_name_loop)

In [42]:
model.fit(X, Y, batch_size=64, epochs=300, callbacks=[name_generator], verbose=0)

Names generated after epoch 0:
aeovda .
mwoccblsu.
bhpfm.

Names generated after epoch 25:
huriposaurus.
wortisausus.
burodusmucanatora.

Names generated after epoch 50:
ujiangonrathos.
umarkasaurus.
.

Names generated after epoch 75:
odonyx.
kitsteptyryxaxa.
ichixavosaurus.

Names generated after epoch 100:
pstoceracovdinakraimoseur.
habrosaurus.
chenosaurus.

Names generated after epoch 125:
ugraceltir.
antarosaurus.
zongyrannauran.

Names generated after epoch 150:
centsoepter.
urakesaurus.
rindan.

Names generated after epoch 175:
eorartia.
jacerltops.
ulansaurus.

Names generated after epoch 200:
yzhoullisaurus.
icetaosaurus.
uncanosaurus.

Names generated after epoch 225:
halg.
ingosaurus.
asthophaurus.

Names generated after epoch 250:
hangzosaurus.
angwatia.
hipposaurus.

Names generated after epoch 275:
ethorosaurus.
elicrisaurus.
ardosaurus.



<tensorflow.python.keras.callbacks.History at 0x7fb050cf4a58>

**FINAL OUTPUT OF NAMES**

In [43]:
for i in range(20):
    make_name(model)

oloroptor.
rizalong.
haypalichus.
oththolia.
untaratons.
utahania.
huylanvisaurus.
ethaisanous.
uncangodia.
narstous.
utahosaurus.
opiseraptor.
ixianosaurus.
eneherpstes.
quillithon.
eninoraptor.
therodontos.
utahelosaurus.
qijballodon.
ormanoceratops.
