# Training a character LSTM Neural Network

In this python notebook, I will be preprocessing and training a character based neural network. Although spoiler alert: it does not work well for this task.

The goal of this model is to give an input name, and generate a predicted trump nickname.

# Lets start with preprocessing!

In [2]:
import pandas as pd
import numpy as np
import spacy

raw = pd.read_csv('cleaned.nicknames.csv')

model = spacy.load('en_core_web_sm')

# add period so training knows when its done
nicknames = [f'{name}.' for name in raw['fake name']]
realnames = [f'{name}' for name in raw['real name']]

raw.head()

Unnamed: 0,fake name,real name,len fake,len real,category,notes,count
0,dumbo,randolph tex alles,1,3,domestic political figures,director of the united states secret service,1
1,wheres hunter,hunter biden,2,2,domestic political figures,american lawyer and lobbyist who is the second...,1
2,1% joe,joe biden,2,2,domestic political figures,47th vice president of the united states; form...,1
3,basement joe,joe biden,3,2,domestic political figures,47th vice president of the united states; form...,1
4,beijing joe,joe biden,3,2,domestic political figures,47th vice president of the united states; form...,1


Here we simply loaded the data and created two lists 'nicknames' and 'realnames' which will be used for vectorizing.

However, first lets grab the part of speech for the nicknames.

# Grabbing POS from nicknames

In [112]:
# pos tagging
def nlp(name):
    name = model(name)
    return {word: word.pos_ for word in name}

test = [nlp(name) for name in raw['fake name']]

test[0:3]

# for name in test:
#     print(name)
#     for token in name:
#         print(token, token.pos_)

[{dumbo: 'PROPN'},
 {where: 'ADV', s: 'PROPN', hunter: 'VERB'},
 {1: 'NUM', %: 'NOUN', joe: 'PROPN'}]

These are the first three nicknames with their part of speech. One thing I notice is that all prefixes are not adjectives.

We will not actually use these in the training, I was simply curious about the tags.

# Lets begin vectorizing

In [113]:
print(nicknames[0:3], realnames[0:3])

['dumbo.', 'wheres hunter.', '1% joe.'] ['randolph tex alles', 'hunter biden', 'joe biden']


The first list is the nicknames, as you can see we added a period at the end of each name. This gives us a marked character for the end of the name, which will be useful for prediction and generation.

Now lets vectorize!

In [4]:
# vectorize names
##########################
allnames = nicknames + realnames

# max char length in names
max_chars = max([len(name) for name in allnames])
# max_real_chars = max([len(name) for name in realnames])
# total number of names
n = len(nicknames)
# nicknames to realnames
nick2real = {nicknames[i]:realnames[i] for i in range(n)}
# character to index
char2i = {char:0 for name in allnames for char in name}
char2i = {char:n for n, char in enumerate(char2i)}
# index to character
i2char = {char2i[char]: char for char in char2i}
char_dimensions = len(i2char)

# set up vectors for output = nicknames
output = np.zeros((n, max_chars, char_dimensions))
# set up vectors for label = real names
label = np.zeros((n, max_chars, char_dimensions))

# vectorize output and labels
for i, name in enumerate(nicknames):
    name = list(name)
    for row, ch in enumerate(name):
        # assign 1 to nickname, character number, vocab index
        output[i, row, char2i[ch]] = 1
        
    for row, ch in enumerate(nick2real[''.join(name)]):
        # assign 1 to real name, character number, vocab index
        label[i, row, char2i[ch]] = 1

print(output[0])

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


As we can see, this is the first nickname that was vectorized. This follows the pattern of having a direct diagnal line of 1's. This is due to our vocabulary setup, as the vocab was pulled directly from our characters, instead of creating a vocabulary on our own.

Either way, we know this worked, so lets move on to training!

# Training the neural network

In [8]:
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import LambdaCallback

model = Sequential()
model.add(LSTM(128, input_shape=(max_chars, char_dimensions), return_sequences=True))
model.add(Dense(char_dimensions, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')
print(model)

<tensorflow.python.keras.engine.sequential.Sequential object at 0x000001F01C971B50>


As we can see, we generated the neural model.

As we do not care about accuracy in the traditional sense for this task, I will build a simple function to generate a name based on the model probabilities. This will be used to generate names at every designated epoch.

In [18]:
def generate_name(model, limit, input):
    # vectorize the input
    word_vec = np.zeros((1, max_chars, char_dimensions))

    def predict(index):
        # pull probabilities for character index
        probabilities = list(model.predict(word_vec)[0,index])
        # normalize probabilities
        probabilities = probabilities / np.sum(probabilities)
        if index == limit-1:
            return '.'
        # guess a letter
        guess = np.random.choice(range(char_dimensions), p=probabilities) # choose a letter
        word_vec[0, index+1, guess] = 1
        return i2char[guess]


    gen_name = ''.join([predict(i) for i in range(limit)])
    print(f'{input}: {gen_name}')

# Training

Now lets run our model

In [19]:
def generate_name_loop(epoch, _):
    if epoch % 999 == 0:
        
        print('Names generated after epoch %d:' % epoch)

        for i, name in enumerate(['donald trump', 'joe biden', 'hillary clinton']):
            generate_name(model, limit = 13, input = name)
        
        print()
      
name_generator = LambdaCallback(on_epoch_end = generate_name_loop)

model.fit(output, label, batch_size=64, epochs=10000, callbacks=[name_generator], verbose=0)

model.save("model.output")

Names generated after epoch 0:
donald trump: jannss  fadd.
joe biden: jannss fadnd.
hillary clinton: cae ck oende.

Names generated after epoch 999:
donald trump: jaun   famer.
joe biden: jannss fadnw.
hillary clinton: blexddssknid.

Names generated after epoch 1998:
donald trump: jarnsr caher.
joe biden: jannss hannw.
hillary clinton: jaeeeb ffdee.

Names generated after epoch 2997:
donald trump: jannss cannw.
joe biden: jannss farnw.
hillary clinton: haalls  hhws.

Names generated after epoch 3996:
donald trump: jannssssaaaw.
joe biden: cae ci conyy.
hillary clinton: mooogeenabat.

Names generated after epoch 4995:
donald trump: aaenss hanni.
joe biden: mamnilccomwo.
hillary clinton: jaeeni  ende.

Names generated after epoch 5994:
donald trump: paless hiiio.
joe biden: geett  hende.
hillary clinton: jannss hannw.

Names generated after epoch 6993:
donald trump: jannss hanww.
joe biden: aalensccciii.
hillary clinton: jannss fannw.

Names generated after epoch 7992:
donald trump: nnnn

Things to note about this training:

+ no name was less than the max characters
    - this means that every generation never predicted the end of the name itself, this might be able to be rectified using the probabilities based on our data set. However, this is a major limiter into the useability and generation of the names.

+ we see names with more than one consecutive space
    - this could easily be corrected by hardcoding a limit to one consecutive space in the name generation, however, it is not worth the effort for this model.

+ the generated names tend to become more word-like, showing a good understanding of how words are put together.
    - However, the biggest problem we see is repeated characters. This could also be hardcoded, but would limit a name that has consecutive letters, such as 'hillary'.
    
As we can see from the epoch generations, the model does not work well at all. However, I am going to save the model to generate some names and evaluate how well it did.

# Model Evaluation


In [20]:
from tensorflow import keras
model = keras.models.load_model("model.output")

In [29]:
names = 'alex kahanek,donald trump,joe biden,barack obama,george bush,hillary clinton,karen,bob,the one guy,dumbo,1% joe,this name is really long'.split(',')
name_limit = [5, 10, 15]

for limit in name_limit:
    print(f'limit: {limit} --------')
    for name in names:
        generate_name(model, limit = limit, input = name)
    print('-----------------------')
    print('\n')

limit: 5 --------
alex kahanek: jann.
donald trump: jaee.
joe biden: mamn.
barack obama: pale.
george bush: jann.
hillary clinton: jann.
karen: ble .
bob: jann.
the one guy: aaen.
dumbo: jann.
1% joe: jann.
this name is really long: jaen.
-----------------------


limit: 10 --------
alex kahanek: palesscci.
donald trump: hhrnllls .
joe biden: ble  iinn.
barack obama: geett  te.
george bush: jannss ca.
hillary clinton: laannsscc.
karen: geee cidt.
bob: jarns  ca.
the one guy: jannss  a.
dumbo: kiille  u.
1% joe: geett  te.
this name is really long: aaenss ha.
-----------------------


limit: 15 --------
alex kahanek: jale   feddees.
donald trump: aolleeooaaaess.
joe biden: blee iinnnnyyr.
barack obama: blexd tfffiii .
george bush: jannss canwwen.
hillary clinton: bjoniiitenrrre.
karen: jannss haeedde.
bob: jamn   hunrrrn.
the one guy: jannsss wdndde.
dumbo: nnnnnnccllllll.
1% joe: geee cidenidei.
this name is really long: aaenssccciiinn.
-----------------------




From this we can see that the character based name generation did not work well. It seems as though there is not enough training data to generate coherent names, as such I think a word based model might perform better for this task.

Notes about why this is bad:

+ as the limit increases, we see more repeated characters
+ the names never stop themselves
+ the algorithm really likes the letter j
+ none of the names make any sense

Notes about what is good:

+ notice that if you took out the repeat letters, the names do resemble words more.
+ it does seem to understand the difference and usage of vowels and constants,
    - if we ignore the repeat letters.

