# Baby name

![](https://images.unsplash.com/photo-1519689680058-324335c77eba?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

In this challenge, you will generate baby names using recurrent neural networks!

The used dataset is in the file `names.txt`, a file encoded in `'ISO-8859-1'`, containing more than 10 000 names.

First load it, and have a look at the names, and clean the dataset if needed.

In [256]:
import pandas as pd
import tensorflow as tf

In [257]:
names = pd.read_csv('../input/names.txt', encoding='ISO-8859-1')

In [258]:
names

Unnamed: 0,name
0,aaliyah
1,aapeli
2,aapo
3,aaren
4,aarne
...,...
11611,zvi
11612,zvonimir
11613,zvonimira
11614,zvonko


The RNN needs to understand where is the beginning and the end of a word. So we need to add a new character at the beginning of every word, for example `'\t'` (it could be anything else as long as it can be identified easily). We can also add `'\n'` to the end of every word as the end.

In [259]:
# TODO: add '\t' at the beginning of every word
names = names.name.apply(lambda x: "\t" + x + "\n")

To generate names, we will have to play at the character level: we will train a RNN to predict the next character, knowing the previous one. So, compute a list of all the possible characters.

In [260]:
# TODO: Compute and display the list of all possible characters
characters = set([c for word in names.values for c in word])

In [261]:
len(characters)

55

You should get 55 characters, right?

As usual when playing with characters (or words), we will convert them into integers. So build a dictionary `char_to_idx` that, given a character as key, returns an integer. And build the opposite dictionary `idx_to_char` that, given an integer as key, returns the corresponding character.

In [262]:
# TODO: Compute the idx_to_char and char_to_idx dict
char_to_idx = {}
i = 0
for char in characters:
    char_to_idx[char] = i
    i+=1

In [263]:
idx_to_char = {v: k for k, v in char_to_idx.items()}

Before going into the neural network part, we have one more step: **create the X and y data**!

So the **X** data is going to be, for every name, all but the `'\n'` character. The **y** data will be all but the `'\t'` character.

Indeed, we will try to predict the following character knowing the previous. To the **X** does not need the final character, and the **y** does not need the first character.

Create the columns X and y to the dataframe.

In [264]:
# TODO: Create the columns X and y
X = names.apply(lambda x: x[:-1])
y = names.apply(lambda x: x[1:])

Now, using your `char_to_idx` dict, compute the corresponding `X` and `y` containing, for each name, a list of integers.

In [265]:
# TODO: Create the X and y variables containing integers only

In [266]:
X = X.apply(lambda x: [char_to_idx[c] for c in x])

In [267]:
y = y.apply(lambda x: [char_to_idx[c] for c in x])

That was complicated, but are now in a known case, use keras and `pad_sequence()` function to get a proper `X` and `y` variables with a `maxlen=16`.

In [268]:
# TODO: Use pad_sequences to get only sequences of length 16 for each name
from tensorflow.keras.preprocessing import sequence

X = sequence.pad_sequences(X,
                           value=char_to_idx['\n'],
                           padding='post', # to add zeros at the end
                           truncating='post', # to cut the end of long sequences
                           maxlen=16) # the length we want

y = sequence.pad_sequences(y,
                           value=char_to_idx['\n'],
                           padding='post', # to add zeros at the end
                           truncating='post', # to cut the end of long sequences
                           maxlen=16) # the length we want

Finally, using the function `to_categorical()`, make the one-hot-encoding needed.

In [269]:
# TODO: use to_categorical to perform one hot encoding
X = tf.keras.utils.to_categorical(X)
y = tf.keras.utils.to_categorical(y)

You should finally have arrays of shape `(number of names, 16, 55)`:
- `16` is the sequence length
- `55` is the number of possible characters

In [270]:
X.shape

(11616, 16, 55)

In [271]:
y.shape

(11616, 16, 55)

Now you have to build a neural network. You can for example use one or two layers of GRU (or LSTM). Do not forget to set `return_sequences=True`. 

Then you will have to add a `TimeDistributed(Dense(55))` with a softmax activation function. This layer will handle the fact you have a dense layer at each time step with a softmax prediction of the next word.

In [272]:
from sklearn.model_selection import train_test_split

In [273]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, LSTM, Dense, TimeDistributed
from tensorflow.keras.callbacks import EarlyStopping

In [274]:
# TODO: Build the neural network
def name_RNN():
    model = Sequential()
    
    model.add(GRU(units=32, activation='relu', return_sequences=True))
    model.add(GRU(units=32, activation='relu', return_sequences=True))
  
    model.add(TimeDistributed(Dense(55, activation='softmax')))
    
    return model

Finally, train your model!

In [275]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [276]:
model = name_RNN()

In [277]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics='accuracy')

In [278]:
callbacks = [EarlyStopping(patience=10, restore_best_weights=True)]

In [279]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size= 64, callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7ff263507910>

The final step will be to generate names, through a function `generate_names()`. 

To do so, you will have to give the output of the previous time step prediction as input to the next time step.

You will have to use the method `predict_proba` of your model, as will as the method `numpy.random.choice`.

Finally, use your function to generate some names!

In [280]:
import numpy as np

In [281]:
model.predict(X_test)[0][2]

array([1.87666453e-02, 4.04689526e-05, 5.01026152e-05, 2.20640097e-03,
       3.23441607e-04, 3.12255532e-03, 2.71252333e-03, 5.36873983e-07,
       1.95922068e-04, 3.64679909e-05, 3.15676220e-06, 1.67048629e-03,
       2.55207706e-05, 2.24082596e-05, 2.76832134e-06, 1.23888913e-05,
       4.21849824e-03, 5.65304690e-05, 8.92294012e-03, 2.55219522e-04,
       4.49251290e-03, 1.28716332e-04, 9.89208370e-03, 5.37380949e-03,
       1.00988123e-04, 1.91884556e-05, 1.63284466e-02, 2.73219612e-05,
       4.64622444e-03, 3.52850257e-05, 3.57403391e-04, 7.31596909e-03,
       2.22575013e-02, 3.14246655e-01, 5.49622206e-03, 4.65169287e-05,
       8.52178559e-02, 1.84275322e-02, 7.46846106e-03, 1.59478765e-02,
       1.73325818e-02, 2.76729371e-03, 8.05831049e-03, 1.93544406e-06,
       3.34900199e-03, 4.73601716e-07, 1.26681434e-05, 1.20163031e-01,
       2.75831044e-01, 2.28373581e-04, 6.65775547e-03, 1.02337486e-04,
       4.77080030e-05, 2.94019608e-03, 2.03564437e-03], dtype=float32)

In [286]:
word = []
for l in model.predict(X_test)[5]:
    idx = np.argmax(l)
    word.append(idx_to_char[idx])

In [287]:
word_test = []
for l in X_test[5]:
    idx = np.argmax(l)
    word_test.append(idx_to_char[idx])

In [288]:
word_test

['\t',
 'h',
 'y',
 'l',
 'e',
 'd',
 'd',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

In [289]:
word

['m',
 'a',
 'r',
 'e',
 'n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

In [290]:
def generate_n_names(n, max_len, char_to_idx, model):
    """
    Generate n names automatically
    
    Returns:

    parameters:
    -- n: the number of names to generate (int)
    -- max_len: the length of the sequence
    -- char_to_idx: the dict giving the char corresponding to each idx
    -- model: the trained model that will be used to generate names
    """
    for _ in range(n):
        stop=False
        ch='\t'
        counter=1
        target_seq = np.zeros((1, max_len, len(char_to_idx)))
        target_seq[0, 0, char_to_idx[ch]] = 1.
        while stop == False and counter < 10:
            #sample the data
            probs = model.predict_proba(target_seq, verbose=0)[:,counter-1,:]
            c = np.random.choice(list(char_to_idx.keys()), replace =False, p=probs.reshape(len(char_to_idx)))
            if c=='\n':
                stop=True
            else:
                ch=ch+c
                target_seq[0, counter ,char_to_idx[c]] = 1.
                counter=counter+1
        print(ch)

In [291]:
generate_n_names(10, 8, char_to_idx, model)

	raévan


IndexError: index 8 is out of bounds for axis 1 with size 8

In case this looks too complicated (indeed it is far from being simple), you can use the function `generate_n_names()` in the file `generate.py`. But first have a look at it and try to understand what it does!

If you have more time, you can try to improve the results by tuning your neural network hyperparameters.

You can also use the original file, `Prenoms.csv`, and use only names from a given origin, to build a model more specific for example.

**Conclusion**: This method can be applied to almost anything: you can generate music, shakespeare, lyrics... using this method. All it takes is to change the data preprocessing and adapt the dimensions.