<a href="https://colab.research.google.com/github/Ladvien/gan_name_maker/blob/master/deep_name_prep_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Name Generator

This project is meant to be a proof-of-concept.  Showing "organic" first names can be generated using a [Generative Advasarial Network](https://en.wikipedia.org/wiki/Generative_adversarial_network). We are using a found dataset provided by [Hadley Wickham](http://hadley.nz/) at RStudio.

The goal will be to vectorize each of the names in the following format:

| a_0 | b_0 | c_0 | ... | z_9 | a_10 | etc |
|-----|-----|-----|-----|-----|------|-----|
|  1  |  0  |  0  | ... |  1  |  0   |  0  |
|  0  |  0  |  1  | ... |  0  |  0   |  0  |

Where the letter is the one-hot encoded representation of a character and the number the placeholder in string.

For example, the name `Abby` would be represented with the following vector.

| a_0 | ... | b_1 | ... | b_2 | ... | y_3 |
|-----|-----|-----|-----|-----|-----|-----|
|  1  | ... |  1  | ... |  1  | ... |  1  |

Given Wickham's dataset also includes:

* `year`
* `percent_[popularity]`
* `sex`

It may be interesting to add these as additional features to allow the model to learn first name contexts.


In [0]:
import pandas as pd
import numpy as np

In [2]:
# Engineering parameters.
pad_character       = '~'
allowed_chars       = f'abcdefghijklmnopqrstuvwxyz{pad_character}'
len_allow_chars     = len(allowed_chars)
max_name_length     = 10 

templated_df = pd.DataFrame()

# Create the dataframe.
for i in range(max_name_length):
  for char in allowed_chars:
    templated_df[char + '_' + str(i)] = 0

# Show the first and last ten columns.
templated_df.columns.tolist()[0:10] + templated_df.columns.tolist()[-10:]

['a_0',
 'b_0',
 'c_0',
 'd_0',
 'e_0',
 'f_0',
 'g_0',
 'h_0',
 'i_0',
 'j_0',
 'r_9',
 's_9',
 't_9',
 'u_9',
 'v_9',
 'w_9',
 'x_9',
 'y_9',
 'z_9',
 '~_9']

In [3]:
!git clone https://github.com/hadley/data-baby-names.git

Cloning into 'data-baby-names'...
remote: Enumerating objects: 70, done.[K
remote: Total 70 (delta 0), reused 0 (delta 0), pack-reused 70[K
Unpacking objects: 100% (70/70), done.


## Examine the Data

In [0]:
df = pd.read_csv('/content/data-baby-names/baby-names.csv')

In [5]:
df.head()

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1,1880,William,0.080511,boy
2,1880,James,0.050057,boy
3,1880,Charles,0.045167,boy
4,1880,George,0.043292,boy


### Name Popularity

In [6]:
df.sort_values(by = 'percent', ascending = False).head(10)

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1000,1881,John,0.080975,boy
1,1880,William,0.080511,boy
3000,1883,John,0.079066,boy
1001,1881,William,0.078712,boy
2000,1882,John,0.078314,boy
4000,1884,John,0.076476,boy
2001,1882,William,0.076191,boy
6000,1886,John,0.07582,boy
5000,1885,John,0.075517,boy


# Preparing Dataframe

## Vectorizing Names

In [0]:
names = df['name'].str.lower().unique()

In [8]:
num_unique_names = len(names)
print(f'Found total of {num_unique_names} unique names')

Found total of 6782 unique names


In [0]:
# TODO: This code could be made tons more performant by refactoring. 
#       Right now, it's a slow as a snail on salt.

def vectorize_name(name, max_name_length, allowed_chars, pad_character):
  tmp = []

  # Standardize
  name = name.lower()

  # Pad the name if needed.
  while len(name) < max_name_length:
    name += pad_character

  feature_index = 0

  # Create the pandas series object.
  name_vector = pd.Series()
  
  # Loop through all placeholders
  for feature_index in range(max_name_length):
      # Loop through all allowed charcters
      for allowed_char in allowed_chars:

          # Create a feature for each allowed character by its placeholder (e.g., "j_4")
          feature_name = allowed_char + '_' + str(feature_index)

          # If the name has a character in the placeholder, flag it as true.
          if name[feature_index] == allowed_char:
            name_vector[feature_name] = 1
          else:
            name_vector[feature_name] = 0
  return name_vector

name_vector = vectorize_name('adam', max_name_length, allowed_chars, pad_character)

In [11]:
name_vector

a_0    1
b_0    0
c_0    0
d_0    0
e_0    0
      ..
w_9    0
x_9    0
y_9    0
z_9    0
~_9    1
Length: 270, dtype: int64

In [13]:
del df
# Create a 'test' feature vector to force building the dataframe feature names.
df = pd.DataFrame([vectorize_name('test', max_name_length, allowed_chars, pad_character)])

for name in names:
  print(name)
  name_vector = vectorize_name(name, max_name_length, allowed_chars, pad_character)
  
  name_vector['name'] = name
  df = df.append(name_vector, ignore_index = True, sort = False)


john
william
james
charles
george
frank
joseph
thomas
henry
robert


KeyboardInterrupt: ignored

In [0]:
df.to_csv('vectorized_names.csv')