<a href="https://colab.research.google.com/github/Ladvien/gan_name_maker/blob/master/deep_name_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Name Generator

This project is meant to be a proof-of-concept.  Showing "organic" first names can be generated using a [Generative Advasarial Network](https://en.wikipedia.org/wiki/Generative_adversarial_network). We are using a found dataset provided by [Hadley Wickham](http://hadley.nz/) at RStudio.

The goal will be to vectorize each of the names in the following format:

| a_0 | b_0 | c_0 | ... | z_9 | a_10 | etc |
|-----|-----|-----|-----|-----|------|-----|
|  1  |  0  |  0  | ... |  1  |  0   |  0  |
|  0  |  0  |  1  | ... |  0  |  0   |  0  |

Where the letter is the one-hot encoded representation of a character and the number the placeholder in string.

For example, the name `Abby` would be represented with the following vector.

| a_0 | ... | b_1 | ... | b_2 | ... | y_3 |
|-----|-----|-----|-----|-----|-----|-----|
|  1  | ... |  1  | ... |  1  | ... |  1  |

Given Wickham's dataset also includes:

* `year`
* `percent_[popularity]`
* `sex`

It may be interesting to add these as additional features to allow the model to learn first name contexts.


## Load the Data

In [0]:
import pandas as pd
import numpy as np

In [0]:
# Engineering parameters.
pad_character       = '~'
allowed_chars       = f'abcdefghijklmnopqrstuvwxyz{pad_character}'
len_allow_chars     = len(allowed_chars)
max_name_length     = 10 

templated_df = pd.DataFrame()

# Create the dataframe.
for i in range(max_name_length):
  for char in allowed_chars:
    templated_df[char + '_' + str(i)] = 0

# Show the first and last ten columns.
templated_df.columns.tolist()[0:10] + templated_df.columns.tolist()[-10:]

In [0]:
!git clone https://github.com/hadley/data-baby-names.git

## Examine the Data

In [0]:
df = pd.read_csv('/content/data-baby-names/baby-names.csv')

In [0]:
df.head()

### Name Popularity

In [0]:
df.sort_values(by = 'percent', ascending = False).head(10)

## Preparing Dataframe

## Vectorizing Names

In [0]:
names = df['name'].str.lower().unique()

In [0]:
num_unique_names = len(names)
print(f'Found total of {num_unique_names} unique names')

In [0]:
# TODO: This code could be made tons more performant by refactoring. 
#       Right now, it's a slow as a snail on salt.

def vectorize_name(name, max_name_length, allowed_chars, pad_character):
  tmp = []

  # Standardize
  name = name.lower()

  # Pad the name if needed.
  while len(name) < max_name_length:
    name += pad_character

  feature_index = 0

  # Create the pandas series object.
  name_vector = pd.Series()
  
  # Loop through all placeholders
  for feature_index in range(max_name_length):
      # Loop through all allowed charcters
      for allowed_char in allowed_chars:

          # Create a feature for each allowed character by its placeholder (e.g., "j_4")
          feature_name = allowed_char + '_' + str(feature_index)

          # If the name has a character in the placeholder, flag it as true.
          if name[feature_index] == allowed_char:
            name_vector[feature_name] = 1
          else:
            name_vector[feature_name] = 0
  return name_vector

name_vector = vectorize_name('adam', max_name_length, allowed_chars, pad_character)

In [0]:
name_vector

In [0]:
del df
# Create a 'test' feature vector to force building the dataframe feature names.
df = pd.DataFrame([vectorize_name('test', max_name_length, allowed_chars, pad_character)])

for name in names:
  print(name)
  name_vector = vectorize_name(name, max_name_length, allowed_chars, pad_character)
  
  name_vector['name'] = name
  df = df.append(name_vector, ignore_index = True, sort = False)


In [0]:
df.to_csv('vectorized_names.csv')

# GAN
Working off the following Keras GAN Example:
https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0

Another good article on GANS and text generation
https://becominghuman.ai/generative-adversarial-networks-for-text-generation-part-1-2b886c8cab10

And a good one on transformers (Attention is All You Need)
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04

https://medium.com/datadriveninvestor/generative-adversarial-network-gan-using-keras-ce1c05cfdfd3

https://machinelearningmastery.com/practical-guide-to-gan-failure-modes/

# Prepared Data
If you'd like to skip to the fun part, I've vectorized the names already.

In [0]:
!git clone https://github.com/Ladvien/gan_name_maker

In [0]:
import pandas as pd
import numpy as np

In [0]:
df = pd.read_csv('./gan_name_maker/vectorized_names.csv')
cols = list(df)

# Move the name column to the beginning.
cols.insert(0, cols.pop(cols.index('name')))
df = df.loc[:, cols]

# Drop the yucky columns.
df.drop('Unnamed: 0', axis = 1, inplace = True)

# Sort values by name
df.sort_values(by = 'name', ascending = True, inplace = True)

In [0]:
print(f'Vectorized data has {df.shape[0]} samples and {df.shape[1]} features.')

In [0]:
df.head()

# Libraries

In [0]:
import tensorflow as tf

from tensorflow.keras.layers import Dense, Dropout, Activation, Input
from tensorflow.keras import Sequential
from tensorflow.keras.models import Model

from tensorflow.keras.callbacks import History 

# Personal tools.
!pip install git+https://github.com/Ladvien/ladvien_ml.git
from ladvien_ml import FeatureModel

fm = FeatureModel()

# Setup Weights and Biases

In [0]:
!pip install --upgrade wandb

In [0]:
!wandb login 186e8a3df54055bf2ce699bf0e3f5320c9bb29e6
import wandb

# Training Parameters

In [0]:
# Parameters
optimizer_name        = 'adadelta'
learning_rate         = 1.0
epochs                = 1000
batch_size            = 32
num_samples           = 10
dropout               = 0.2
generator_inputs      = 800
width_modifier        = 0.5 # Multiplied by inputs to give neurons in dense.

# Input shape will be the number of possible characters times 
# the maximum name length allowed.
vectorized_name_length = len_allow_chars * max_name_length

params = {
    'epoch': epochs,
    'batch_size': batch_size,
    'optimizer_name': optimizer_name,
    'generator_inputs': generator_inputs,
    'num_samples_per_step': num_samples,
    'allowed_chars': allowed_chars,
    'max_name_length': max_name_length,
    'dropout': dropout,
    'width_modifier': width_modifier
}

In [0]:
wandb.init(project = 'deep_name_generator',
           config = params)

# Discriminator

In [0]:
def discriminator(input_shape, optimizer, dropout = 0.1, width_modifier = 0.5):
  
  D = Sequential()
  
  # Input layer
  input_layer_width = input_shape
  D.add(Dense(input_layer_width, input_shape = (input_layer_width,), activation = 'relu'))
  D.add(Dropout(dropout))
  
  # First Hidden Layer
  first_layer_width = input_shape * width_modifier
  D.add(Dense(first_layer_width, activation = 'relu'))
  D.add(Dropout(dropout))

  # Second Hidden Layer
  second_layer_width = input_shape * width_modifier
  D.add(Dense(second_layer_width, activation = 'relu'))
  D.add(Dropout(dropout))

  # Third Hidden Layer
  third_layer_width = input_shape * width_modifier
  D.add(Dense(third_layer_width, activation = 'relu'))
  D.add(Dropout(dropout))
 
  # Output
  D.add(Dense(1, activation = 'sigmoid'))
  D._name = 'discriminator'
  D.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
  D.summary()

  return D

# Generator

In [0]:
def generator(num_inputs, output_shape, optimizer, dropout = 0.1, width_modifier = 0.5):

  G = Sequential()

  # Input layer
  input_layer_width = num_inputs
  G.add(Dense(input_layer_width, input_shape = (input_layer_width,), activation = 'relu'))
  G.add(Dropout(dropout))
  
  # First Hidden Layer
  first_layer_width = num_inputs * width_modifier
  G.add(Dense(first_layer_width, activation = 'relu'))
  G.add(Dropout(dropout))

  # Second Hidden Layer
  second_layer_width = num_inputs * width_modifier
  G.add(Dense(second_layer_width, activation = 'relu'))
  G.add(Dropout(dropout))

  # Third Hidden Layer
  third_layer_width = num_inputs * width_modifier
  G.add(Dense(third_layer_width, activation = 'relu'))
  G.add(Dropout(dropout))
  
  # Output layer 
  G.add(Dense(output_shape, activation = 'sigmoid'))
  G._name = 'generator'
  G.compile(optimizer = optimizer, loss = 'categorical_crossentropy')
  G.summary()
  return G

# GAN

In [0]:
def create_gan(D, G, g_inputs):
    D.trainable = False
    gan_input = Input(shape = (g_inputs,))
    x = G(gan_input)
    gan_output = D(x)
    gan = Model(inputs = gan_input, outputs = gan_output)
    gan.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop')
    return gan

# Compile



In [0]:
# Select optimizer.
optimizer = fm.select_optimizer(optimizer_name, learning_rate)

# Generator
G = generator(generator_inputs, vectorized_name_length, optimizer, dropout = dropout)

# Discriminator
D = discriminator(vectorized_name_length, optimizer, dropout = dropout)

# Build GAN
GAN = create_gan(D, G, generator_inputs)

GAN._name = 'GAN'
GAN.compile(loss='binary_crossentropy', optimizer = optimizer, metrics=['accuracy'])
GAN.summary()

## Prepare Data

In [0]:
# Randomize inputs.
df = df.sample(df.shape[0])

# Create target label.
df['real'] = 1

# Drop the 'name' and 'real' columns.
X = df.iloc[:,1:-1]

# Get target
y = df.iloc[:,-1:]

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Evaluation Method

In [0]:
def retrieve_names_from_sparse_matrix(generated_names, pad_character):
  retrieved_names = []
  for name_index in range(len(generated_names)):
    generated_name = ''
    name_array = generated_names[name_index]
    for char_index in range(max_name_length):
      
      # Get A index.
      first_letter_index = (char_index * len_allow_chars)
      last_letter_index = (char_index * len_allow_chars + len_allow_chars)
      char_vector = list(name_array[first_letter_index:last_letter_index])
      
      char = allowed_chars[char_vector.index(max(char_vector))]

      if char == pad_character:
        break
        
      generated_name += char

    # print(generated_name) 
    retrieved_names.append(generated_name)

  return retrieved_names

# Training

In [0]:
# Loading the data
batch_count = x_train.shape[0] / batch_size

for e in range(1, epochs + 1):
    print(f'Epoch: {e}')
    for step in range(batch_size):
        # Generate noise as input to initialize generator.
        noise = np.random.normal(0, 1, [batch_size, generator_inputs])
        
        # Generate fake names from noise.
        generated_names = G.predict(noise)
        
        # Get a random set of  real images
        real_names = x_train.iloc[np.random.randint(low = 0, high = x_train.shape[0], size = batch_size),:]

        #Construct different batches of  real and fake data 
        X = np.concatenate([real_names, generated_names])
        
        # Labels for generated and real data (first four rows are real)
        y_labels = np.zeros(2 * batch_size)
        y_labels[:batch_size] = 1
        
        # Pre-train discriminator on fake and real data before starting the GAN. 
        D.trainable = True
        D.train_on_batch(X, y_labels)
        
        # Tricking the noised input of the Generator as real data
        noise = np.random.normal(0, 1, [batch_size, generator_inputs])
        y_gen = np.ones(batch_size)
        
        # During the training of GAN, the weights of discriminator should be 
        # fixed. We can enforce that by setting the trainable flag.
        D.trainable = False
        
        # Train the GAN by alternating the training of the Discriminator 
        # and training the chained GAN model with Discriminator’s weights 
        # frozen.
        GAN_score = GAN.train_on_batch(noise, y_gen)
        print(f'GAN loss: {GAN_score[0]}')
        D_score = D.evaluate(X, y_labels)
        print(f'Disc. loss: {D_score}')

    # Log to Weights and Biases
    wandb.log({'GAN Loss': GAN_score[0], 
              'epoch': e, 
              'discriminator_loss': D_score[0],
              'discriminator_accuracy': D_score[1]
    })

    # Make Generator inputs.
    noise = np.random.normal(0, 1, [num_samples, generator_inputs])

    # Generate fake names from noise.
    generated_names = G.predict(noise)
    retrieved_names = retrieve_names_from_sparse_matrix(generated_names, pad_character)

    # Save generated names.
    table = wandb.Table(columns=['Name', 'Epoch'])

    for name in retrieved_names:
      table.add_data(name, e)
    wandb.log({"generated_names": table})
          

In [0]:
# Make Generator inputs.
noise = np.random.normal(0, 1, [20, generator_inputs])

# Generate fake names from noise.
generated_names = G.predict(noise)
retrieve_names_from_sparse_matrix(generated_names, pad_character)