# Pokémon Name Generation with Keras

Generate new unique Pokémon names with a LSTM using Andrej Karpathy's famous [Char-RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) which he used to generate poetry. There are more information in the blog, but the concept is fairly simple. We want the build a next-character-in-text predictor. We will do this by using a window of fixed length as our input and the next char as output and then train a LSTM to perform this task. Since the network won't understand raw characters we need to encode each character to a character vectors with one-hot encoding.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
!pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import pandas as pd
import numpy as np
import keras
import time
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from tensorflow.keras import optimizers
#from keras.optimizers import RMSprop
import numpy as np
import random
import os

## Settings

In [4]:
step_length = 1    # The step length we take to get our samples from our corpus
epochs = 100       # Number of times we train on our full data
batch_size = 32    # Data samples in each training step
latent_dim = 256    # Size of our LSTM
dropout_rate = 0.5 # Regularization with dropout
model_path = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/output/poke_gen_model.h5') # Location for the model
load_model = True # Enable loading model from disk
store_model = True # Store model to disk after training
verbosity = 1      # Print result for each epoch
gen_amount = 10    # How many 

## Loading data

This version uses an html table from bulbapedia. Also does preprocessing like removing special characters and only using lowercase characters.

In [5]:
# Input paths for the individual tables
input_path_1 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_1.html')
input_path_2 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_2.html')
input_path_3 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_3.html')
input_path_4 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_4.html')
input_path_5 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_5.html')
input_path_6 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_6.html')
input_path_7 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_7.html')
input_path_8 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_8.html')
input_path_9 = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/input/bulbapedia_gen_9.html')
# Construct a df of all files
df = pd.read_html(input_path_1, index_col='Ndex')[0][['PokÃ©mon','Type','Type.1']]
df = df.append(pd.read_html(input_path_2)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)
df = df.append(pd.read_html(input_path_3)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)
df = df.append(pd.read_html(input_path_4)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)
df = df.append(pd.read_html(input_path_5)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)
df = df.append(pd.read_html(input_path_6)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)
df = df.append(pd.read_html(input_path_7)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)
df = df.append(pd.read_html(input_path_8)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)
#df = df.append(pd.read_html(input_path_9)[0][['PokÃ©mon','Type','Type.1']], ignore_index=True)

df.rename(columns = {'PokÃ©mon':'Pokemon'}, inplace=True)
df['Pokemon'] = df['Pokemon'].str.lower()
df['Type'] = df['Type'].str.lower()
df['Type.1'] = df['Type.1'].str.lower()
df=df.replace(r'[^a-z]','', regex=True)

dummy_types = pd.get_dummies(df, prefix='', prefix_sep='', columns=['Type'])
dummy_types_1 = pd.get_dummies(df, prefix='', prefix_sep='', columns=['Type.1'])
dummy_types.drop(['Pokemon', 'Type.1'], axis=1, inplace=True)
dummy_types_1.drop(['Pokemon', 'Type'], axis=1, inplace=True)
dummy_types = dummy_types.add(dummy_types_1, fill_value=0).astype(int)
df.drop(['Type', 'Type.1'], axis=1, inplace=True)
df = df.join(dummy_types)

In [6]:
df

Unnamed: 0,Pokemon,bug,dark,dragon,electric,fairy,fighting,fire,flying,ghost,grass,ground,ice,normal,poison,psychic,rock,steel,water
0,bulbasaur,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
1,ivysaur,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
2,venusaur,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,charmander,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0
4,charmeleon,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
971,ursaluna,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
972,basculegion,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
973,sneasler,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
974,overqwil,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


## Preprocessing
- Concatenate all Pokémon names into a long string corpus.
- Build dicionaries to translate chars to indices in a binary char vector.
- Find a suitable sequence window, I base it on the longest name I find.

In [7]:
sample_df = df.sample(frac=1, random_state=1).copy()
input_names = sample_df['Pokemon']

# Make it all to a long string
concat_names = '\n'.join(input_names).lower()
concat_types = []
name_idx = 0
for c in concat_names:
  if c == '\n':
    name_idx += 1
    concat_types.append([0]*len(concat_types[0]))
  else:
    concat_types.append(sample_df.values.tolist()[name_idx][1:])
# Find all unique characters by using set()
chars = sorted(list(set(concat_names)))
num_chars = len(chars)

# Build translation dictionaries, 'a' -> 0, 0 -> 'a'
char2idx = dict((c, i) for i, c in enumerate(chars))
idx2char = dict((i, c) for i, c in enumerate(chars))

# Use longest name length as our sequence window
max_sequence_length = max([len(name) for name in input_names])

print('Total chars: {}'.format(num_chars))
print('Corpus length names:', len(concat_names))
print('Corpus length types: ', len(concat_types))
print('Number of names: ', len(input_names))
print('Longest name: ', max_sequence_length)

Total chars: 27
Corpus length names: 8283
Corpus length types:  8283
Number of names:  976
Longest name:  12


Make a training set where we take samples with sequence length as our input and the next char as label.

In [8]:
type_sums = []
types = []
sequences = []
next_chars = []

# Loop over our data and extract pairs of sequances and next chars
for i in range(0, len(concat_names) - max_sequence_length, step_length):
    types.append(concat_types[i: i + max_sequence_length])
    sequences.append(concat_names[i: i + max_sequence_length])
    next_chars.append(concat_names[i + max_sequence_length])

num_sequences = len(sequences)

print('Number of sequences:', num_sequences)
print('First 10 sequences and next chars:')
for i in range(10):
    print('X=[{}] \t y=[{}] \t first_char_type%={}'.replace('\n', ' ').format(sequences[i], next_chars[i], str(types[i][0])).replace('\n', ' '))

Number of sequences: 8271
First 10 sequences and next chars:
X=[cursola litl] 	 y=[e] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[ursola litle] 	 y=[o] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[rsola litleo] 	 y=[ ] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[sola litleo ] 	 y=[r] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[ola litleo r] 	 y=[o] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[la litleo ro] 	 y=[t] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[a litleo rot] 	 y=[o] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[ litleo roto] 	 y=[m] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X=[litleo rotom] 	 y=[ ] 	 first_char_type%=[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
X=[itleo rotom ] 	 y=[s] 	 first_char_type%=[0, 

One-hot encoding our data into char vectors by using the translation dictionary from earlier.

#### Example

- 'a'   => [1, 0, 0, ..., 0]

- 'b'   => [0, 1, 0, ..., 0]

- 'c'   => [0, 0, 1, ..., 0]

- 'abc' => [[1, 0, 0, ..., 0], [0, 1, 0, ..., 0], [0, 0, 1, ..., 0]] 

In [9]:
X = np.zeros((num_sequences, max_sequence_length, num_chars+len(concat_types[0])), dtype=bool)
Y = np.zeros((num_sequences, num_chars), dtype=bool)

for i, sequence in enumerate(sequences):
  for j, char in enumerate(sequence):
    X[i, j, char2idx[char]] = 1
  for j in range(max_sequence_length):
    for k in range(len(types[0][0])):
      X[i, j, num_chars+k] = types[i][j][k]>0
  Y[i, char2idx[next_chars[i]]] = 1

    
print('X shape: {}'.format(X.shape))
print('Y shape: {}'.format(Y.shape))

X shape: (8271, 12, 45)
Y shape: (8271, 27)


## Build model

Build a standard LSTM network with: 

- Input shape: (max_sequence_length x num_chars) - representing our sequences.
- Output shape: num_chars - representing the next char coming after each sequence.
- Output activation: Softmax - since only one value should be 1 in output char vector.
- Loss: Categorical cross-entrophy - standard loss for multi-class classification.

In [10]:
model = Sequential()
model.add(LSTM(latent_dim, 
               input_shape=(max_sequence_length, num_chars+len(concat_types[0])),  
               recurrent_dropout=dropout_rate))
model.add(Dense(units=num_chars, activation='softmax'))

optimizer = optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer)

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 256)               309248    
                                                                 
 dense (Dense)               (None, 27)                6939      
                                                                 
Total params: 316,187
Trainable params: 316,187
Non-trainable params: 0
_________________________________________________________________


  super(RMSprop, self).__init__(name, **kwargs)


## Training

Watching the loss, doing cross-validation and all that good stuff is not that important here. The best model will not be found by optimizing some metric. We just want to strike a balance between a model that just output gibberish like 'sadsdaddddd' and model that memorizes the names it was trained on. For this it is better to just inspect the output and judge from that.

In [11]:
if load_model:
    model.load_weights(model_path)
else:
    
    start = time.time()
    print('Start training for {} epochs'.format(epochs))
    history = model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=verbosity)
    end = time.time()
    print('Finished training - time elapsed:', (end - start)/60, 'min')
    
if store_model:
    print('Storing model at:', model_path)
    model.save(model_path)

Storing model at: /content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/output/poke_gen_model.h5


## Generation

Generate names by starting with a real sequence from the corpus, continuously predicting the next char while updating the sequence. To get diversity the correct char is selected from a probability distribution based on the models prediction. This can also be furthered by something called temperature, which I didn't use here.

I also added some postprocessing to remove things I did not like manually. Some of this could possibly be done by teaking the network, but I was happy with the way the names looked overall. 

In [12]:
# Start sequence generation from end of input sequence
sequence = concat_names[-(max_sequence_length - 1):] + '\n'
type_sequence = concat_types[-(max_sequence_length - 1):]
type_sequence.append([0]*len(concat_types[0]))

desired_type = [1,  #Bug
                0,  #Dark
                0,  #Dragon
                0,  #Electric
                0,  #Fairy
                0,  #Fighting
                0,  #Fire
                0,  #Flying
                0,  #Ghost
                0,  #Grass
                0,  #Ground
                0,  #Ice
                0,  #Normal
                0,  #Poison
                0,  #Psychic
                0,  #Rock
                0,  #Steel
                0]  #Water

new_names = []

print('{} new names are being generated'.format(gen_amount))

while len(new_names) < gen_amount:
    
    # Vectorize sequence for prediction
    x = np.zeros((1, max_sequence_length, num_chars+len(desired_type)))
    for i, char in enumerate(sequence):
      x[0, i, char2idx[char]] = 1
    for i in range(max_sequence_length):
      for j in range(len(desired_type)):
        x[0, i, num_chars+j] = type_sequence[i][j]

    # Sample next char from predicted probabilities
    probs = model.predict(x, verbose=0)[0]
    probs /= probs.sum()
    next_idx = np.random.choice(len(probs), p=probs)   
    next_char = idx2char[next_idx]   
    sequence = sequence[1:] + next_char
    type_sequence = type_sequence[1:] + [desired_type]

    # New line means we have a new name
    if next_char == '\n':
        type_sequence[-1] = [0]*len(concat_types[0])
        gen_name = [name for name in sequence.split('\n')][1]

        # Never start name with two identical chars, could probably also
        if len(gen_name) > 2 and gen_name[0] == gen_name[1]:
            gen_name = gen_name[1:]

        # Discard all names that are too short
        if len(gen_name) > 2:
            
            # Only allow new and unique names
            if not (gen_name in input_names.to_list() or gen_name.capitalize() in new_names):
                new_names.append(gen_name.capitalize())

        if 0 == (len(new_names) % (gen_amount/ 10)):
            print('Generated {}'.format(len(new_names)))

10 new names are being generated
Generated 1
Generated 2
Generated 3
Generated 4
Generated 5
Generated 6
Generated 7
Generated 8
Generated 9
Generated 10


## Results

Here are the results. I personally cannot tell the difference between generated names and names of Pokémon I dont know. Sometimes there are giveaways, but overall the names are convincing and diverse!

In [13]:
print_first_n = min(10, gen_amount)

print('First {} generated names:'.format(print_first_n))
for name in new_names[:print_first_n]:
    print(name)

First 10 generated names:
Zotzzle
Zagastlea
Coscytthor
Buttle
Wormadar
Silcom
Anumisk
Sicvelipe
Derescoon
Ildurrmot


Storing the results

In [15]:
concat_output = '\n'.join(sorted(new_names))
output_path = os.path.realpath('/content/drive/MyDrive/Colab Notebooks/pokemon-name-generator/output/generated_names.txt')

with open(output_path, 'a') as f:
    f.write(concat_output)