# Training a Nerual Network to generate Trump Nicknames

This is going to be a word-based approach to nickname generation. We will be going through the preprocessing required to tokenize the nicknames, based off the data set pulled from [this link here]. The data was cleaned and analyzed by me, click [this link here] to see that analysis.

# Grabbing the data

In [14]:
import pandas as pd
import numpy as np

raw = pd.read_csv('cleaned.nicknames.csv')

raw.head()

Unnamed: 0,fake name,real name,len fake,len real,category,notes,count
0,dumbo,randolph tex alles,1,3,domestic political figures,director of the united states secret service,1
1,wheres hunter,hunter biden,2,2,domestic political figures,american lawyer and lobbyist who is the second...,1
2,1% joe,joe biden,2,2,domestic political figures,47th vice president of the united states; form...,1
3,basement joe,joe biden,3,2,domestic political figures,47th vice president of the united states; form...,1
4,beijing joe,joe biden,3,2,domestic political figures,47th vice president of the united states; form...,1


Here is the data we are working with. We have the nicknames (fake name) and the corrosponding real name of the individual the nickname was given to by trump. We lso have a few other columns, however those will not matter for this task.

# Tokenizing the names

Here we need to seperate each word and add a tag. We will be adding a few created tags:

+ real names:
    - these will be given a tag for each word in the name, for example,
        + 'joe biden' will become 'joe <name1> biden <name2>'

+ nicknames:
    - these follow the rule that if a real name is in the nickname, it will be replaced with the corrosponding real name tag.
        + ie, `basement joe` will become `basement <name1>`
    - as well as the name substitution, we will add a `<prefix>` tag to every other word before the name, and `<suffix>` to the words after the substitution.

These modifications will be made because we will use the generated `<nameX>` tags to use from the user input. For example, if a user inputs `Joe Biden`, and the generated name follows `<prefix> <name1>` the generation algorithm can substitute the users `<name1>` with `joe` in this example, so that all we need to generate is the `<prefix>` tag. If there are no prefix and suffix, then the tag will be `<nope>`.

In [18]:
def tokenize(realname, nickname):

    # get a dictionary for the real name and corrospoinding token, ie input
    real2token = {word: f'<name{X+1}>' for X, word in enumerate(realname.split(' '))}
    # convert dictionary to word tokenized groups and join into single string
    real_tokenized = ' '.join([f'{word} {real2token[word]}' for word in real2token])

    # change nickname into single string with tokenization and substitution
    # grab names to substitute
    subs = [sub for sub in realname.split(' ') if sub in nickname]

    if len(splits) == 0:
        # then there are no splits, tokens are <nope>
        nick_tokenized = ' '.join([f'{word} <nope>' for word in nickname.split(' ')])
        return nick_tokenized

    # 
    substituted = ' '.join([word if word not in subs else f'{real2token[word]}' for word in nickname.split(' ')])

    token = '<prefix>'
    tokenized = []

    for word in substituted.split(' '):
        if '<' in word and 'prefix' in token:
            token = '<suffix>'
        else:
            tokenized.append(f'{word} {token}')
    
    print(' '.join(tokenized))

    # print(f'{nickname}: {substituted}')

    # elif 0 < len(nickname.split(' ')) < 3:
    #     # then there are splits, however there is only a single prefix of suffix
    #     print('-------------')
    #     for sub in splits:
    #         print(nickname.split(sub))
    #     print('-------------')
    #     # ' '.join([f'{word} ' for word in nickname.split(' ')])
    # elif 3 < len(nickname.split(' ')):
    #     print('-------------')
    #     for sub in splits:
    #         print(nickname.split(sub))
    #     print('-------------')
        
    # nick_tokenized = ' '.join([f'{word} <prefix>' if word not in realname else f'{real2token[word]}' for word in nickname.split(' ')])

    # send back tuple of input, output
    # return (real_tokenized, nick_tokenized)



tokenized_names = [tokenize(realname, nickname) for i, nickname, realname in raw[['fake name', 'real name']].itertuples()]

# tokenized_names[0:3]

wheres <name1>
wheres <prefix> <name1> <suffix>
1% <name1>
1% <prefix> <name1> <suffix>
basement <name1>
basement <prefix> <name1> <suffix>
beijing <name1>
beijing <prefix> <name1> <suffix>
china <name1>
china <prefix> <name1> <suffix>
corrupt <name1>
corrupt <prefix> <name1> <suffix>
crazy <name1>
crazy <prefix> <name1> <suffix>
quid pro <name1>
quid <prefix> pro <prefix> <name1> <suffix>
sleepy <name1>
sleepy <prefix> <name1> <suffix>
sleepy creepy <name1>
sleepy <prefix> creepy <prefix> <name1> <suffix>
slow <name1>
slow <prefix> <name1> <suffix>
<name1> hiden
<name1> <suffix> hiden <suffix>
obiden
obiden <prefix>
little <name1> <name2>
little <prefix> <name1> <suffix> <name2> <suffix>
mini mike <name2>
mini <prefix> mike <prefix> <name2> <suffix>
da nang <name1>
da <prefix> nang <prefix> <name1> <suffix>
gov <name1> moonbeam <name2>
gov <prefix> <name1> <suffix> moonbeam <suffix> <name2> <suffix>
<name4> original
<name4> <suffix> original <suffix>
low energy <name1>
low <prefix> en