# Baby name

![](https://images.unsplash.com/photo-1519689680058-324335c77eba?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

In this challenge, you will generate baby names using recurrent neural networks!

The used dataset is in the file `names.txt`, a file encoded in `'ISO-8859-1'`, containing more than 10 000 names.

First load it, and have a look at the names, and clean the dataset if needed.

In [42]:
# TODO: Load the dataset and explore it
import pandas as pd
import numpy as np

with open('../input/names.txt', 'r', encoding="ISO-8859-1") as fh:
    names = [line for line in fh]
print(names[:10])

['name\n', 'aaliyah\n', 'aapeli\n', 'aapo\n', 'aaren\n', 'aarne\n', 'aarón\n', 'aaron\n', 'aatami\n', 'aatto\n']


The RNN needs to understand where is the beginning and the end of a word. So we need to add a new character at the beginning of every word, for example `'\t'` (it could be anything else as long as it can be identified easily). We can also add `'\n'` to the end of every word as the end.

In [43]:
# TODO: add '\t' at the beginning of every word
names = ['\t' + name for name in names]
print(names[:10])

['\tname\n', '\taaliyah\n', '\taapeli\n', '\taapo\n', '\taaren\n', '\taarne\n', '\taarón\n', '\taaron\n', '\taatami\n', '\taatto\n']


To generate names, we will have to play at the character level: we will train a RNN to predict the next character, knowing the previous one. So, compute a list of all the possible characters.

In [44]:
# TODO: Compute and display the list of all possible characters
## more than the alphabet including accents
##list_unique_chars = [[list(set([n for n in name])) for name in names]]
list_unique_chars = []
for name in names:
    for n in name:
        if n not in list_unique_chars:
            list_unique_chars.append(n)

list_unique_chars.sort()
print(list_unique_chars)
len(list_unique_chars)

['\t', '\n', '-', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'á', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'ü', 'þ']


55

You should get 55 characters, right?

As usual when playing with characters (or words), we will convert them into integers. So build a dictionary `char_to_idx` that, given a character as key, returns an integer. And build the opposite dictionary `idx_to_char` that, given an integer as key, returns the corresponding character.

In [45]:
# TODO: Compute the idx_to_char and char_to_idx dict
char_to_idx = {}
idx_to_char = {}
for idx, char in enumerate(list_unique_chars):
    char_to_idx[char] = idx
    idx_to_char[idx]  = char

char_to_idx

{'\t': 0,
 '\n': 1,
 '-': 2,
 'a': 3,
 'b': 4,
 'c': 5,
 'd': 6,
 'e': 7,
 'f': 8,
 'g': 9,
 'h': 10,
 'i': 11,
 'j': 12,
 'k': 13,
 'l': 14,
 'm': 15,
 'n': 16,
 'o': 17,
 'p': 18,
 'q': 19,
 'r': 20,
 's': 21,
 't': 22,
 'u': 23,
 'v': 24,
 'w': 25,
 'x': 26,
 'y': 27,
 'z': 28,
 'à': 29,
 'á': 30,
 'ã': 31,
 'ä': 32,
 'å': 33,
 'æ': 34,
 'ç': 35,
 'è': 36,
 'é': 37,
 'ê': 38,
 'ë': 39,
 'ì': 40,
 'í': 41,
 'ï': 42,
 'ð': 43,
 'ñ': 44,
 'ò': 45,
 'ó': 46,
 'ô': 47,
 'õ': 48,
 'ö': 49,
 'ø': 50,
 'ù': 51,
 'ú': 52,
 'ü': 53,
 'þ': 54}

Before going into the neural network part, we have one more step: **create the X and y data**!

So the **X** data is going to be, for every name, all but the `'\n'` character. The **y** data will be all but the `'\t'` character.

Indeed, we will try to predict the following character knowing the previous. To the **X** does not need the final character, and the **y** does not need the first character.

Create the columns X and y to the dataframe.

In [71]:
# TODO: Create the columns X and y
X = [name[:-1] for name in names[1:]]
y = [name[1:] for name in names[1:]]
X


['\taaliyah',
 '\taapeli',
 '\taapo',
 '\taaren',
 '\taarne',
 '\taarón',
 '\taaron',
 '\taatami',
 '\taatto',
 '\taatu',
 '\tabaddon',
 '\tabbán',
 '\tabbas',
 '\tabbey',
 '\tabbie',
 '\tabby',
 '\tabd-al-aziz',
 '\tabd-allah',
 '\tabd-al-malik',
 '\tabd-al-qadir',
 '\tabd-al-rahman',
 '\tabdul',
 '\tabdul-aziz',
 '\tabdullah',
 '\tabdul-rahman',
 '\tabe',
 '\tabednego',
 '\tabegail',
 '\tábel',
 '\tabel',
 '\tabelone',
 '\tabena',
 '\tabeni',
 '\tabhay',
 '\tabiah',
 '\tabidan',
 '\tabidemi',
 '\tabiel',
 '\tabigail',
 '\tabigayle',
 '\tabihu',
 '\tabijah',
 '\tabilene',
 '\tabimael',
 '\tabiram',
 '\tabisai',
 '\tabishag',
 '\tabishai',
 '\tabital',
 '\tabla',
 '\tabner',
 '\tabraham',
 '\tabram',
 '\tabsalom',
 '\tabsolon',
 '\tacacia',
 '\tacacius',
 '\tacantha',
 '\tace',
 '\tachaicus',
 '\tachan',
 '\tachieng',
 '\tachille',
 '\tachilles',
 '\tachim',
 '\tacke',
 '\tada',
 '\tadah',
 '\tadair',
 '\tadalbert',
 '\tadalberto',
 '\tadalheid',
 '\tadalia',
 '\tádám',
 '\tadam',
 '\t

Now, using your `char_to_idx` dict, compute the corresponding `X` and `y` containing, for each name, a list of integers.

In [48]:
# TODO: Create the X and y variables containing integers only
X_int = [[char_to_idx[n] for n in name] for name in X]
print(X_int)

y_int = [[char_to_idx[n] for n in name] for name in y]
print(y_int)

[[16, 3, 15, 7, 1], [3, 3, 14, 11, 27, 3, 10, 1], [3, 3, 18, 7, 14, 11, 1], [3, 3, 18, 17, 1], [3, 3, 20, 7, 16, 1], [3, 3, 20, 16, 7, 1], [3, 3, 20, 46, 16, 1], [3, 3, 20, 17, 16, 1], [3, 3, 22, 3, 15, 11, 1], [3, 3, 22, 22, 17, 1], [3, 3, 22, 23, 1], [3, 4, 3, 6, 6, 17, 16, 1], [3, 4, 4, 30, 16, 1], [3, 4, 4, 3, 21, 1], [3, 4, 4, 7, 27, 1], [3, 4, 4, 11, 7, 1], [3, 4, 4, 27, 1], [3, 4, 6, 2, 3, 14, 2, 3, 28, 11, 28, 1], [3, 4, 6, 2, 3, 14, 14, 3, 10, 1], [3, 4, 6, 2, 3, 14, 2, 15, 3, 14, 11, 13, 1], [3, 4, 6, 2, 3, 14, 2, 19, 3, 6, 11, 20, 1], [3, 4, 6, 2, 3, 14, 2, 20, 3, 10, 15, 3, 16, 1], [3, 4, 6, 23, 14, 1], [3, 4, 6, 23, 14, 2, 3, 28, 11, 28, 1], [3, 4, 6, 23, 14, 14, 3, 10, 1], [3, 4, 6, 23, 14, 2, 20, 3, 10, 15, 3, 16, 1], [3, 4, 7, 1], [3, 4, 7, 6, 16, 7, 9, 17, 1], [3, 4, 7, 9, 3, 11, 14, 1], [30, 4, 7, 14, 1], [3, 4, 7, 14, 1], [3, 4, 7, 14, 17, 16, 7, 1], [3, 4, 7, 16, 3, 1], [3, 4, 7, 16, 11, 1], [3, 4, 10, 3, 27, 1], [3, 4, 11, 3, 10, 1], [3, 4, 11, 6, 3, 16, 1], [3, 4,

That was complicated, but are now in a known case, use keras and `pad_sequence()` function to get a proper `X` and `y` variables with a `maxlen=16`.

In [50]:
# TODO: Use pad_sequences to get only sequences of length 16 for each name
from tensorflow.keras.preprocessing import sequence

max_len_name = 16
oov_token = 55 # prendre la 1ère valeur d'indice dispo !!
## \t a l'indice 0 !!!! donc ne pas prendre cet indice pour oov_token !!

X_train = sequence.pad_sequences(X_int,
                                 value=oov_token,
                                 padding='post',
                                 maxlen=max_len_name)

y_train = sequence.pad_sequences(y_int,
                                 value=oov_token,
                                 padding='post',
                                 maxlen=max_len_name)
X_train.shape, y_train.shape

((11617, 16), (11617, 16))

Finally, using the function `to_categorical()`, make the one-hot-encoding needed.

In [51]:
# TODO: use to_categorical to perform one hot encoding
### CORRECTIF: je n'ai eu que 1h30 pour faire les challenges !!!
from tensorflow.keras.utils import to_categorical

X_train = to_categorical(X_train)
y_train = to_categorical(y_train)

X_train.shape, y_train.shape
## 55 = nb of unique characters in the reference dictionary

((11617, 16, 55), (11617, 16, 55))

In [65]:
X_train = np.array(X_train)
y_train = np.array(y_train)
X_train.shape, y_train.shape

((11617, 16, 55), (11617, 16, 55))

You should finally have arrays of shape `(number of names, 16, 55)`:
- `16` is the sequence length
- `55` is the number of possible characters

Now you have to build a neural network. You can for example use one or two layers of GRU (or LSTM). Do not forget to set `return_sequences=True`. 

Then you will have to add a `TimeDistributed(Dense(55))` with a softmax activation function. This layer will handle the fact you have a dense layer at each time step with a softmax prediction of the next word.

In [66]:
# TODO: Build the neural network
## cf le code de cours du 11/06 - Advanced RNN
from keras.models import Sequential
from keras.layers import GRU, Dense, Embedding, Input, TimeDistributed

## RNN Many-to-Many => return_sequences=True sur la dernière GRU layer
def gru(input_length: int, input_dim: int, embedding_dim: int, hidden_layer_size: tuple[int, ...]) -> Sequential:
    model = Sequential()

    model.add(Input(shape=(input_length,)))

    # We add a RNN layer
    for i in range(len(hidden_layer_size) - 1):
        model.add(GRU(units=hidden_layer_size[i], return_sequences=True))
    model.add(GRU(units=hidden_layer_size[-1], return_sequences=True))

    # Finally we add a softmax
    model.add(TimeDistributed(Dense(len(list_unique_chars), activation='softmax')))

    return model


### CORRECTIF
def get_baby_name(): 
    model = Sequential()
    model.add(GRU(32, input_shape=(max_len_name, len(list_unique_chars)), return_sequences=True))
    model.add(GRU(32, return_sequences=True))
    model.add(TimeDistributed(Dense(len(list_unique_chars), activation='softmax')))
    return model

Finally, train your model!

In [67]:
# TODO: fit the model
##model = gru(input_length=len(X_train[0]), input_dim=10000, embedding_dim=32, hidden_layer_size=(8,))
model = get_baby_name()
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru_5 (GRU)                 (None, 16, 32)            8544      
                                                                 
 gru_6 (GRU)                 (None, 16, 32)            6336      
                                                                 
 time_distributed_3 (TimeDis  (None, 16, 55)           1815      
 tributed)                                                       
                                                                 
Total params: 16,695
Trainable params: 16,695
Non-trainable params: 0
_________________________________________________________________


In [68]:
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(X_train, y_train, batch_size=64, epochs=50)

Epoch 1/50
Epoch 2/50

KeyboardInterrupt: 

The final step will be to generate names, through a function `generate_names()`. 

To do so, you will have to give the output of the previous time step prediction as input to the next time step.

You will have to use the method `predict_proba` of your model, as will as the method `numpy.random.choice`.

Finally, use your function to generate some names!

In [11]:
# TODO: implement the function generate_names


In case this looks too complicated (indeed it is far from being simple), you can use the function `generate_n_names()` in the file `generate.py`. But first have a look at it and try to understand what it does!

If you have more time, you can try to improve the results by tuning your neural network hyperparameters.

You can also use the original file, `Prenoms.csv`, and use only names from a given origin, to build a model more specific for example.

**Conclusion**: This method can be applied to almost anything: you can generate music, shakespeare, lyrics... using this method. All it takes is to change the data preprocessing and adapt the dimensions.