<a href="https://colab.research.google.com/github/Ladvien/gan_name_maker/blob/master/deep_name_prep_data_sparse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Name Generator

This project is meant to be a proof-of-concept.  Showing "organic" first names can be generated using a [Generative Advasarial Network](https://en.wikipedia.org/wiki/Generative_adversarial_network). We are using a found dataset provided by [Hadley Wickham](http://hadley.nz/) at RStudio.

The goal will be to vectorize each of the names in the following format:

| char_0 | char_2 | char_3 | ... | char_9 | char_10 | etc |
|-----|-----|-----|-----|-----|------|-----|
|  4  |  3  |  0  | ... |  19  |  0   |  17  |
|  24  |  2  |  1  | ... |  11  |  2   |  3  |

Where the letter is the one-hot encoded representation of a character and the number the placeholder in string.

For example, the name `Abby` would be represented with the following vector.

| char_0 | char_2 | char_3 | char_4 | char_5 | etc |
|-----|-----|-----|-----|-----|-----|
|  0  | 1  |  1  | 24 |  27 | .... |

Given Wickham's dataset also includes:

* `year`
* `percent_[popularity]`
* `sex`

It may be interesting to add these as additional features to allow the model to learn first name contexts.


In [0]:
import pandas as pd
import numpy as np

In [0]:
# Engineering parameters.
pad_character       = '~'
allowed_chars       = f'abcdefghijklmnopqrstuvwxyz{pad_character}'
len_allow_chars     = len(allowed_chars)
max_name_length     = 10 

templated_df = pd.DataFrame()

# Create the dataframe.
for i in range(max_name_length):
    templated_df['char' + '_' + str(i)] = 0

# Show the first and last ten columns.
templated_df.columns.tolist()[0:10] + templated_df.columns.tolist()[-10:]

['char_0',
 'char_1',
 'char_2',
 'char_3',
 'char_4',
 'char_5',
 'char_6',
 'char_7',
 'char_8',
 'char_9',
 'char_0',
 'char_1',
 'char_2',
 'char_3',
 'char_4',
 'char_5',
 'char_6',
 'char_7',
 'char_8',
 'char_9']

In [0]:
!git clone https://github.com/hadley/data-baby-names.git

fatal: destination path 'data-baby-names' already exists and is not an empty directory.


## Examine the Data

In [0]:
df = pd.read_csv('/content/data-baby-names/baby-names.csv')

In [0]:
df.head()

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1,1880,William,0.080511,boy
2,1880,James,0.050057,boy
3,1880,Charles,0.045167,boy
4,1880,George,0.043292,boy


### Name Popularity

In [0]:
df.sort_values(by = 'percent', ascending = False).head(10)

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1000,1881,John,0.080975,boy
1,1880,William,0.080511,boy
3000,1883,John,0.079066,boy
1001,1881,William,0.078712,boy
2000,1882,John,0.078314,boy
4000,1884,John,0.076476,boy
2001,1882,William,0.076191,boy
6000,1886,John,0.07582,boy
5000,1885,John,0.075517,boy


# Preparing Dataframe

## Vectorizing Names

In [0]:
names = df['name'].str.lower().unique()

In [0]:
num_unique_names = len(names)
print(f'Found total of {num_unique_names} unique names')

Found total of 6782 unique names


In [0]:
from sklearn.preprocessing import LabelEncoder

# Fire up an encoder
le = LabelEncoder()

# Standardize
df['name'] = df['name'].str.lower()

# Exclude all names over the maximum name length.
df = df[df['name'].str.len() <= max_name_length]

# Fill empty spaces
df['name'] = df['name'].str.ljust(max_name_length, pad_character)

# Store the actual name for a moment.
tmp = df['name'].iloc[len(allowed_chars):].reset_index(drop = True)

# Chop the name into columns by character.
df = df['name'].apply(lambda x: pd.Series(list(x)))



In [0]:
df.shape

(257730, 10)

In [0]:
# Encode the characters
df = df.apply(le.fit_transform)

In [0]:
# Give the columns names
df.columns = ['char_' + str(x) for x in range(df.shape[1])]

In [0]:
df = df.reset_index(drop = True)
df['name'] = tmp

In [0]:
df.tail(100)

Unnamed: 0,char_0,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,name
257900,2,0,17,8,13,0,25,23,14,10,
257901,10,0,17,11,8,26,25,23,14,10,
257902,12,0,6,3,0,11,4,12,0,10,
257903,18,19,4,15,7,0,12,21,14,10,
257904,2,7,0,17,11,8,24,4,14,10,
...,...,...,...,...,...,...,...,...,...,...,...
257995,2,0,17,11,4,8,6,7,14,10,
257996,8,24,0,13,0,26,25,23,14,10,
257997,10,4,13,11,4,24,25,23,14,10,
257998,18,11,14,0,13,4,25,23,14,10,


In [0]:
df.to_csv('vectorized_names_sparse.csv')

In [0]:
df.head()

Unnamed: 0,char_0,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,name
0,9,14,7,13,26,26,25,23,14,10,jesse~~~~~
1,22,8,11,11,8,0,11,23,14,10,oscar~~~~~
2,9,0,12,4,18,26,25,23,14,10,lewis~~~~~
3,2,7,0,17,11,4,17,23,14,10,peter~~~~~
4,6,4,14,17,6,4,25,23,14,10,benjamin~~
