<a href="https://colab.research.google.com/github/Ladvien/gan_name_maker/blob/master/deep_name_prep_data_sparse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Name Generator

This project is meant to be a proof-of-concept.  Showing "organic" first names can be generated using a [Generative Advasarial Network](https://en.wikipedia.org/wiki/Generative_adversarial_network). We are using a found dataset provided by [Hadley Wickham](http://hadley.nz/) at RStudio.

The goal will be to vectorize each of the names in the following format:

| char_0 | char_2 | char_3 | ... | char_9 | char_10 | etc |
|-----|-----|-----|-----|-----|------|-----|
|  4  |  3  |  0  | ... |  19  |  0   |  17  |
|  24  |  2  |  1  | ... |  11  |  2   |  3  |

Where the letter is the one-hot encoded representation of a character and the number the placeholder in string.

For example, the name `Abby` would be represented with the following vector.

| char_0 | char_2 | char_3 | char_4 | char_5 | etc |
|-----|-----|-----|-----|-----|-----|
|  0  | 1  |  1  | 24 |  27 | .... |

Given Wickham's dataset also includes:

* `year`
* `percent_[popularity]`
* `sex`

It may be interesting to add these as additional features to allow the model to learn first name contexts.


In [0]:
import pandas as pd
import numpy as np

In [11]:
# Engineering parameters.
pad_character       = '~'
allowed_chars       = f'abcdefghijklmnopqrstuvwxyz{pad_character}'
len_allow_chars     = len(allowed_chars)
max_name_length     = 10 

templated_df = pd.DataFrame()

# Create the dataframe.
for i in range(max_name_length):
    templated_df['char' + '_' + str(i)] = 0

# Show the first and last ten columns.
templated_df.columns.tolist()[0:10] + templated_df.columns.tolist()[-10:]

['char_0',
 'char_1',
 'char_2',
 'char_3',
 'char_4',
 'char_5',
 'char_6',
 'char_7',
 'char_8',
 'char_9',
 'char_0',
 'char_1',
 'char_2',
 'char_3',
 'char_4',
 'char_5',
 'char_6',
 'char_7',
 'char_8',
 'char_9']

In [12]:
!git clone https://github.com/hadley/data-baby-names.git

fatal: destination path 'data-baby-names' already exists and is not an empty directory.


## Examine the Data

In [0]:
df = pd.read_csv('/content/data-baby-names/baby-names.csv')

In [14]:
df.head()

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1,1880,William,0.080511,boy
2,1880,James,0.050057,boy
3,1880,Charles,0.045167,boy
4,1880,George,0.043292,boy


### Name Popularity

In [15]:
df.sort_values(by = 'percent', ascending = False).head(10)

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1000,1881,John,0.080975,boy
1,1880,William,0.080511,boy
3000,1883,John,0.079066,boy
1001,1881,William,0.078712,boy
2000,1882,John,0.078314,boy
4000,1884,John,0.076476,boy
2001,1882,William,0.076191,boy
6000,1886,John,0.07582,boy
5000,1885,John,0.075517,boy


# Preparing Dataframe

## Vectorizing Names

In [0]:
names = df['name'].str.lower().unique()

In [17]:
num_unique_names = len(names)
print(f'Found total of {num_unique_names} unique names')

Found total of 6782 unique names


In [0]:
# TODO: This code could be made tons more performant by refactoring. 
#       Right now, it's a slow as a snail on salt.

def vectorize_name(name, max_name_length, allowed_chars, pad_character):

  # Standardize
  name = name.lower()

  # Pad the name if needed.
  while len(name) < max_name_length:
    name += pad_character

  # Create the pandas series object.
  name_vector = pd.Series()

  for i in range(max_name_length):
    feature_name = 'char_' + str(i)

    name_vector[feature_name] = allowed_chars.index(name[i])

  return name_vector

name_vector = vectorize_name('adam', max_name_length, allowed_chars, pad_character)

In [20]:
name_vector

char_0     0
char_1     3
char_2     0
char_3    12
char_4    26
char_5    26
char_6    26
char_7    26
char_8    26
char_9    26
dtype: int64

In [21]:
del df
# Create a 'test' feature vector to force building the dataframe feature names.
df = pd.DataFrame([vectorize_name('test', max_name_length, allowed_chars, pad_character)])

for name in names:
  print(name)
  name_vector = vectorize_name(name, max_name_length, allowed_chars, pad_character)
  
  name_vector['name'] = name
  df = df.append(name_vector, ignore_index = True, sort = False)


john
william
james
charles
george
frank
joseph
thomas
henry
robert
edward
harry
walter
arthur
fred
albert
samuel
david
louis
joe
charlie
clarence
richard
andrew
daniel
ernest
will
jesse
oscar
lewis
peter
benjamin
frederick
willie
alfred
sam
roy
herbert
jacob
tom
elmer
carl
lee
howard
martin
michael
bert
herman
jim
francis
harvey
earl
eugene
ralph
ed
claude
edwin
ben
charley
paul
edgar
isaac
otto
luther
lawrence
ira
patrick
guy
oliver
theodore
hugh
clyde
alexander
august
floyd
homer
jack
leonard
horace
marion
philip
allen
archie
stephen
chester
willis
raymond
rufus
warren
jessie
milton
alex
leo
julius
ray
sidney
bernard
dan
jerry
calvin
perry
dave
anthony
eddie
amos
dennis
clifford
leroy
wesley
alonzo
garfield
franklin
emil
leon
nathan
harold
matthew
levi
moses
everett
lester
winfield
adam
lloyd
mack
fredrick
jay
jess
melvin
noah
aaron
alvin
norman
gilbert
elijah
victor
gus
nelson
jasper
silas
christopher
jake
mike
percy
adolph
maurice
cornelius
felix
reuben
wallace
claud
roscoe
sylvest

In [0]:
df.to_csv('vectorized_names_sparse.csv')

In [23]:
df.head()

Unnamed: 0,char_0,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,name
0,19,4,18,19,26,26,26,26,26,26,
1,9,14,7,13,26,26,26,26,26,26,john
2,22,8,11,11,8,0,12,26,26,26,william
3,9,0,12,4,18,26,26,26,26,26,james
4,2,7,0,17,11,4,18,26,26,26,charles
