# Training a Nerual Network to generate Trump Nicknames

This is going to be a word-based approach to nickname generation. We will be going through the preprocessing required to tokenize the nicknames, based off the data set pulled from [this link here]. The data was cleaned and analyzed by me, click [this link here] to see that analysis.

# Grabbing the data

In [27]:
import pandas as pd
import numpy as np

raw = pd.read_csv('cleaned.nicknames.csv')

raw.head()

Unnamed: 0,fake name,real name,len fake,len real,category,notes,count
0,dumbo,randolph tex alles,1,3,domestic political figures,director of the united states secret service,1
1,wheres hunter,hunter biden,2,2,domestic political figures,american lawyer and lobbyist who is the second...,1
2,1% joe,joe biden,2,2,domestic political figures,47th vice president of the united states; form...,1
3,basement joe,joe biden,3,2,domestic political figures,47th vice president of the united states; form...,1
4,beijing joe,joe biden,3,2,domestic political figures,47th vice president of the united states; form...,1


Here is the data we are working with. We have the nicknames (fake name) and the corrosponding real name of the individual the nickname was given to by trump. We lso have a few other columns, however those will not matter for this task.

# Tokenizing the names

Here we need to seperate each word and add a tag. We will be adding a few created tags:

+ real names:
    - these will be given a tag for each word in the name, for example,
        + 'joe biden' will become 'joe <name1> biden <name2>'

+ nicknames:
    - these follow the rule that if a real name is in the nickname, it will be replaced with the corrosponding real name tag.
        + ie, `basement joe` will become `basement <name1>`
    - a `<prefix>` tag will be added to every other word before the substitution, and a `<suffix>` tag to the substitution and words after.
    - if there are no substitutions, then a `<nope>` tag will be added to each word.

-----------------------
reword this mess

These modifications will be made because we will use the generated `<nameX>` tags to use from the user input. For example, if a user inputs `Joe Biden`, and the generated name follows `<prefix> <name1>` the generation algorithm can substitute the users `<name1>` with `joe` in this example, so that all we need to generate is the `<prefix>` tag. Although the model will need to predict a name tag from the suffix part of the nickname. This will help with training, as we only need to predict the length of the nickname, then the tags that follow. For example, if we generate tags as `<prefix> <prefix> <suffix>`, we can generate the best for each category, where the name tags can only come from the `<suffix>` tag.

In [261]:
def tokenize(realname, nickname):
    '''tokenizes reach real name and nickname to follow the rules defined'''

    # get a dictionary for the real name and corrospoinding token, ie input
    real2token = {word: f'<name{X+1}>' for X, word in enumerate(realname.split(' '))}
    # convert dictionary to word tokenized groups and join into single string
    real_tokenized = ' '.join([f'{word} {real2token[word]}' for word in real2token])

    # change nickname into single string with tokenization and substitution
    # grab names to substitute
    subs = [sub for sub in realname.split(' ') if sub in nickname]

    if len(subs) == 0:
        # then there are no splits, tokens are <nope>
        nick_tokenized = ' '.join([f'{word} <nope>' for word in nickname.split(' ')])
        return (real_tokenized, nick_tokenized)

    substituted = ' '.join([word if word not in subs else f'{real2token[word]}' for word in nickname.split(' ')])

    token = '<prefix>'
    tokenized = []

    for word in substituted.split(' '):
        if '<' in word and 'prefix' in token:
            token = '<suffix>'

        tokenized.append(f'{word} {token}')
    
    nick_tokenized = ' '.join(tokenized)

    return (real_tokenized, nick_tokenized)


tokenized_names = [tokenize(realname, nickname) for i, nickname, realname in raw[['fake name', 'real name']].itertuples()]

print('real name | nickname')
tokenized_names[0:3]

real name | nickname


[('randolph <name1> tex <name2> alles <name3>', 'dumbo <nope>'),
 ('hunter <name1> biden <name2>', 'wheres <prefix> <name1> <suffix>'),
 ('joe <name1> biden <name2>', '1% <prefix> <name1> <suffix>')]

Okay, not that everything is tokenized, we can start to vectorize it!

# Vectorizing the tokenized names

To vectorize this we need to define our vocabulary, then create a matrix for each name. The matrix will have the maximum token length as rows, and the total vocab words as columns. To vectorize it, we simply put a 1 in the row for the corrosponding token in the vocab index.

In [294]:
##### get vocab ######
# flatten paired list to get all names
flatten = [name for pair in tokenized_names for name in pair]
# flatten = [nick for (real, nick) in tokenized_names]
# make sure the created tokens go first in the vocad dictionaries
flatten[:0] = ['<prefix>', '<suffix>', '<nope>'] #, '<name1>', '<name2>', '<name3>', '<name4>', '<name5>', '<name6>']

###### get dictionaries #########
# get dictionary of nicknames to
nick2real = {nickname:realname for (realname, nickname) in tokenized_names}
# get dictionaries for vocab
uni_tokens = {token:0 for name in flatten for token in name.split(' ')}
# dictionary for token to index
token2i = {token:i for i, token in enumerate(uni_tokens)}
# dictionary for index to token
i2token = {token2i[token]: token for token in token2i}
# dictionary for word to affix

######### math ##########
# find total number of nicknames
n = len(tokenized_names)
# find max number of tokens, ie, rows in matrix
max_tokens = max([len(name.split(' ')) for name in flatten])
# find total number of tokens, ie columns in matrix
token_dimensions = len(i2token)

##### get matricies ########
# set up vectors for output = nicknames
output = np.zeros((n, max_tokens, token_dimensions))
# set up vectors for label = real names
label = np.zeros((n, max_tokens, token_dimensions))

#### vectorize names #######
for i, nickname in enumerate(nick2real):
    # input assignment
    for row, token in enumerate(nickname.split(' ')):
        output[i, row, token2i[token]] = 1
        label[i, row, token2i[token]] = 1
    
    # output assignment
    # for row, token in enumerate(nick2real[nickname].split(' ')):
    #     label[i, row, token2i[token]] = 1

print(output[1])
    



[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Awesome, here we can see a vectorized word! It doesn't look like much because the dimensions are very large, but there is a single 1 at every character index for the row of the token.

From here we need to build the model and train it!

# Building the model

In [281]:
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import LambdaCallback

model = Sequential()
model.add(LSTM(128, input_shape=(max_tokens, token_dimensions), return_sequences=True))
model.add(Dense(token_dimensions, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')
print(model)

<tensorflow.python.keras.engine.sequential.Sequential object at 0x000002324CC99CD0>


Now that the model is built, we need to build a couple functions. Since this isnt a traditional training, we need to evaluate our model by actually generating nicknames. To do this, I will build a generate function and generate a few names at every nth epoch.

# Functions and training

In [295]:
def generate_name(model, gen_len, input):
    '''generates a nickname based on length of word and input given'''

    word_vec = np.zeros((1, max_tokens, token_dimensions))

    # get a dictionary for the real name and corrospoinding token, ie input
    token2input = {f'<name{X+1}>': word for X, word in enumerate(input.split(' '))}
    # convert dictionary to word tokenized groups and join into single string
    input_tokenized = ' '.join([f'{token2input[token]} {token}' for token in token2input])

    def predict_affix(index):
        '''(index+1)*2 to predict affix tags'''
        # pull probabilities for character index
        p_affix = list(model.predict(word_vec)[0,index])[0:3]
        # normalize probabilities
        p_affix_norm = p_affix / np.sum(p_affix)

        # guess a letter
        guess = np.random.choice(range(len(p_affix_norm)), p=p_affix_norm) # choose an affix
        word_vec[0, index, guess] = 1
        # print(f'p={p_affix_norm}, g={i2token[guess]}, i={index}')
        return i2token[guess]

    def predict_token(index, affix):
        p_token = list(model.predict(word_vec)[0,index])[4:]
        p_token_norm = p_token / np.sum(p_token)

        guess = np.random.choice(range(4, len(p_token_norm)+4), p=p_token_norm)
        word = i2token[guess]
        # print(f'g={guess}, i={index}')
        if 'name' in word:
            if word not in input_tokenized:
                word = '<name1>'
            guess = token2i[word]
            word = token2input[word]
            

        word_vec[0, index, guess] = 1
        return word


    affix = [predict_affix((i+1)*2 -1) for i in range(gen_len)]
    print(f'{input}: {affix}')
    tokens = [predict_token(i*2, affix) for i, affix in enumerate(affix)]
    print(f'{input}: {" ".join(tokens)}')

In [296]:
def generate_name_loop(epoch, _):
    if epoch % 10 == 0:
        
        print('Names generated after epoch %d:' % epoch)

        for i, name in enumerate(['donald trump']):
            generate_name(model, gen_len = 3, input = name)
        
        print('-------------')
      
name_generator = LambdaCallback(on_epoch_end = generate_name_loop)

model.fit(output, label, batch_size=64, epochs=500, callbacks=[name_generator], verbose=0)

model.save("word.model.output")

Names generated after epoch 0:
donald trump: ['<nope>', '<nope>', '<nope>']
donald trump: wacky for tax
-------------
Names generated after epoch 10:
donald trump: ['<nope>', '<nope>', '<nope>']
donald trump: morning psycho flailer
-------------
Names generated after epoch 20:
donald trump: ['<prefix>', '<suffix>', '<suffix>']
donald trump: trump trump wannabe
-------------
Names generated after epoch 30:
donald trump: ['<prefix>', '<suffix>', '<suffix>']
donald trump: schitt donald half
-------------
Names generated after epoch 40:
donald trump: ['<prefix>', '<suffix>', '<suffix>']
donald trump: record trump trump
-------------
Names generated after epoch 50:
donald trump: ['<prefix>', '<suffix>', '<suffix>']
donald trump: little donald donald
-------------
Names generated after epoch 60:
donald trump: ['<prefix>', '<suffix>', '<suffix>']
donald trump: for tax canada
-------------
Names generated after epoch 70:
donald trump: ['<nope>', '<nope>', '<nope>']
donald trump: wannabe h flun

In [297]:
from tensorflow import keras
model = keras.models.load_model("word.model.output")

names = 'alex kahanek,donald trump,joe biden,barack obama,this name is really long'.split(',')
gen_len = [2, 3, 4]

for limit in gen_len:
    print(f'limit: {limit} --------')
    for name in names:
        generate_name(model, gen_len = limit, input = name)
    print('-----------------------')
    print('\n')

limit: 2 --------
alex kahanek: ['<prefix>', '<suffix>']
alex kahanek: hiden nang
donald trump: ['<nope>', '<nope>']
donald trump: corrupt for
joe biden: ['<prefix>', '<suffix>']
joe biden: cuban joe
barack obama: ['<nope>', '<nope>']
barack obama: dicky mcmuffin
this name is really long: ['<prefix>', '<suffix>']
this name is really long: sleepy creepy
-----------------------


limit: 3 --------
alex kahanek: ['<prefix>', '<suffix>', '<suffix>']
alex kahanek: flunky crime kahanek
donald trump: ['<prefix>', '<suffix>', '<suffix>']
donald trump: the from sham
joe biden: ['<prefix>', '<suffix>', '<suffix>']
joe biden: slimeball biden biden
barack obama: ['<nope>', '<nope>', '<nope>']
barack obama: 0% h the
this name is really long: ['<nope>', '<nope>', '<nope>']
this name is really long: dopey cnn flunky
-----------------------


limit: 4 --------
alex kahanek: ['<prefix>', '<suffix>', '<suffix>', '<suffix>']
alex kahanek: sick puppy a nancy
donald trump: ['<prefix>', '<suffix>', '<suffix

add length of nickname to vector?

figure out how to jump names, ie if name1 then name2
figure out how to stop guessing names, ie dictionary of possible words? pre, suff, nope dictionary?
probabilities for generated name length