# Girl Name Generation with Numpy and Basic RNN

Imagine there is a young couple expecting the born of their first baby child. They soon find out that the baby will be a girl but they are not sure yet what to name her. The purpose of this project is to generate girl name recommendation with the help of character level basic and simple RNN architecture. 

Because the low level API using numpy will be applied, then the very simple one layer RNN architecture is applied to avoid the overly complex algorithm. A text file contains about 3000 girl names is provided and can be seen from this GitHub repo https://github.com/dominictarr/random-name/blob/master/first-names.txt.

Before we jump into the RNN architecture, let's import all necessary libraries for this project.

In [1]:
import numpy as np
import random

The first thing that we need to do is to preprocess the text data. First, we need to read the text file. Then, we need to convert all of the words into lowercase to maintain the consistency. Thirdly, we need to create a list of unique characters found in the text data. 

Let's do all of the steps explained above.

In [2]:
data = open('girl_name.txt', 'r').read()
data= data.lower() #Transform all of characters into lowercase
chars = list(set(data)) # Create set of unique characters
data_size, vocab_size = len(data), len(chars)

In [3]:
print('There are '+str(vocab_size)+' unique characters in your text data')

There are 29 unique characters in your text data


As we can see from the ouput above, we have 29 unique characters from the text data. Let's see what those unique characters are.

In [4]:
chars = sorted(chars)
print(chars)

['\n', ' ', '-', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


From the ouput above, we can see that we have three unusual characters, which are "\n", " ", and "-". " " and "-" are special characters that we normally find in certain people's name, like for example Mary-Anne or van Domersmack for example. Meanwhile "\n" means an end of a word, or similar as token EOS (End-of-Sentence). 

Next, we need to create a dictionary that will be beneficial to map a character to its corresponding index and vice versa.

In [5]:
char_to_idx = { key: value for value,key in enumerate(chars) }
idx_to_char = { value:key for value,key in enumerate(chars) }

## RNN Architecture

Now it's time to build the simple Recurrent Neural Networks model. Below is the figure about how the architecture of RNN model in this project looks like:

<img src="rnn.png" style="width:600 ; height:220px;">

In this project, there will be only one simple RNN layer, as shown in figure above. The step on how to compute the simple RNN model is as follows:

- At the first time step, the input character $x^{\langle 1 \rangle}$ and the hidden state $a^{\langle 0 \rangle}$ will be set to 0.
- Next, we need to run one step of forward propagation in between layers in order to get the hidden state at the next time step, $a^{\langle 1 \rangle}$ and the character output in that time step, $\hat{y}^{\langle 1 \rangle}$.

The equation to compute hidden state, activation function, and the prediction can be seen as follows:

Hidden state:
$$ a^{\langle t+1 \rangle} = \tanh(W_{ax}  x^{\langle t+1 \rangle } + W_{aa} a^{\langle t \rangle } + b)$$
Activation:
$$ z^{\langle t + 1 \rangle } = W_{ya}  a^{\langle t + 1 \rangle } + b_y $$
Prediction:
$$ \hat{y}^{\langle t+1 \rangle } = softmax(z^{\langle t + 1 \rangle })$$

Before we build a function to do all of these steps, first let's define a function to compute the softmax activation function.

In [6]:
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

Next, we can define a function to do all of the steps defined above

In [7]:
def RNN_cell(parameters, char_to_idx, seed):
    
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    # Creating a zero vector x that can be used as the one-hot vector 
    # representing the first character. 
    x = np.zeros((vocab_size,1))
    
    # Initialize hidden state at previous time step as zeros.
    a_prev = np.zeros((n_a,1))
   
    # Create an empty list of indices.
    indices = []
    
    # idx is the index of the one-hot vector x that is set to 1. In this case, initialize idx to -1.
    idx = -1 
    
    counter = 0
    newline_character = char_to_idx['\n']
    
    while (idx != newline_character and counter != 50): # maximize the number of character in each word to 50.
        
        # Forward propagate x.
        a = np.tanh(np.dot(Wax,x)+np.dot(Waa,a_prev)+b)
        z = np.dot(Wya,a)+by
        y = softmax(z)
        
        # Sample the index of a character within the vocabulary from the probability distribution y
        # so that it will not always generate the same character given a previous character.
        idx = np.random.choice(vocab_size, p = y.ravel())

        # Append the index to "indices"
        indices.append(idx)
        
        # Overwrite the input x with one that corresponds to the sampled index `idx`.
        x = np.zeros((vocab_size,1))
        x[idx] = 1
        
        # Update the hidden state to current time steps.
        a_prev = a
        
        counter +=1

    if (counter == 50):
        indices.append(char_to_idx['\n'])
    
    return indices

Next, we need to initialize all of the weights and bias parameters that we need to update in each epochs during the optimization process.

In [8]:
def initialize_parameters(n_a, n_x, n_y):
    
    Wax = np.random.randn(n_a, n_x)*0.01  # input to hidden
    Waa = np.random.randn(n_a, n_a)*0.01  # hidden to hidden
    Wya = np.random.randn(n_y, n_a)*0.01  # hidden to output
    b = np.zeros((n_a, 1)) # hidden bias
    by = np.zeros((n_y, 1)) # output bias
    
    parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
    
    return parameters

## Forward Propagation

Next, it's time to forward propagate the RNN model. The forward propagation steps in this model is exactly the same as the one that already defined in `RNN_cell` function. First, the input vector $x$ and input hidden state $a$ will be multiplied by their corresponding weights. Then the bias term is added and finally, the tanh activation function is applied to add non-linearity to the model. Finally, the softmax activation function is used to predict the output character based on one time step. 

In [9]:
def rnn_step_forward(parameters, a_prev, x):
    
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # hidden state
    p_t = softmax(np.dot(Wya, a_next) + by) # unnormalized log probabilities for next chars # probabilities for next chars 
    
    return a_next, p_t

In the function above, we performed a feed forward propagation for one RNN cell. Hence, let's define a function that will do those operations in every time step.

In [10]:
def rnn_forward(X, Y, a0, parameters, vocab_size = 29):
    
    # Initialize x, a and y_hat as empty dictionaries
    x, a, y_hat = {}, {}, {}
    
    a[-1] = np.copy(a0)
    
    # initialize loss to 0
    loss = 0
    
    for t in range(len(X)): #iterate every time step
        
        # Set x[t] to be the one-hot vector representation of the t'th character in X.
        # if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector. 
        x[t] = np.zeros((vocab_size,1)) 
        if (X[t] != None):
            x[t][X[t]] = 1
        
        # Run one step forward of the RNN
        a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])
        
        # Update the loss by substracting the cross-entropy term of this time-step from it.
        loss -= np.log(y_hat[t][Y[t],0])
    
    # Store the result in a cache which will be useful for backpropagation
    cache = (y_hat, a, x)
        
    return loss, cache

And the functions to do forward propagation has just defined. Next, let's define a function for backpropagation algorithm.

## Backpropagation Algorithm

Everybody can agree that the most difficult algorithm to model in a deep learning architecture is its backpropagation algorithm and RNN is not an exception. It requires a lot of derivations and for some people it is not that intuitive. Luckily, there is a closed form solution for basic RNN with softmax activation function such that what we need to do is simply apply this mathematical formulation to the algorithm.

Let's define a function to compute the backward propagation.

In [11]:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    
    gradients['dWya'] += np.dot(dy, a.T)
    gradients['dby'] += dy
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] 
    daraw = (1 - a * a) * da 
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    
    return gradients

In the function defined above, we only applied the backpropagation algorithm in one time step. Let's define a function such that we run the backpropagation in every time step.

In [12]:
def rnn_backward(X, Y, parameters, cache):
    
    # Initialize gradients as an empty dictionary
    gradients = {}
    
    # Retrieve from cache and parameters
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    
    # each one should be initialized to zeros of the same dimension as its corresponding parameter
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])
    

    # Backpropagate through time
    for t in reversed(range(len(X))):
        dy = np.copy(y_hat[t])
        dy[Y[t]] -= 1
        gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])

    
    return gradients, a

## Parameters Update, Gradient Clipping, and Optimization

After applying backpropagation algorithm in every time step, now it is time to use the gradients of weights and bias parameters to update their corresponding variable. The application of this updating is very straightforward, in which we subtract the weight with the learning rate and the gradients. Below is the formula for parameters updating:

$$ W^{[t]} = W^{[t]} - \alpha \text{ } dW^{[t]}$$
$$ b^{[t]} = b^{[t]} - \alpha \text{ } db^{[t]}$$

Let's define a function to do this operation.

In [13]:
def update_parameters(parameters, gradients, lr):

    parameters['Wax'] += -lr * gradients['dWax']
    parameters['Waa'] += -lr * gradients['dWaa']
    parameters['Wya'] += -lr * gradients['dWya']
    parameters['b']  += -lr * gradients['db']
    parameters['by']  += -lr * gradients['dby']
    return parameters

Now we've basically defined all the necessary functions for optimization process. However, it is a common theory that one of the biggest pet peeve in using simple RNN architecture is that sometimes it has a problem with its gradient, either a problem with vanishing gradient or exploding gradient.

One of the technique to deal with exploding gradient in simple RNN architecture is clipping gradient. With clipping gradient, the gradient result that is too large or too low will be supressed into a pre-defined maximum value or minimum value, hence the exploding gradient problem can be avoided. Let's define a function for gradient clipping.

In [14]:
def gradientClip(gradients, max_value):
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    for gradient in [dWax, dWaa, dWya, db, dby]:
        
        np.clip(gradient, -max_value, max_value, out = gradient)
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

Finally, we can define a function to run a full loop of optimization process. This function will wrap all of the steps defined above, from forward propagation, back propagation, parameter updates, and the application of clipping gradient. Let's define this function.

In [15]:
def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    
    # Forward propagate through time
    loss, cache = rnn_forward(X, Y, a_prev, parameters)
    
    # Backpropagate through time
    gradients, a = rnn_backward(X, Y, parameters, cache)
    
    # Clip gradients between -5 (min) and 5 (max)
    gradients = gradientClip(gradients, 5)
    
    # Update parameter
    parameters = update_parameters(parameters, gradients, learning_rate)
    
    return loss, gradients, a[len(X)-1]

## Build the Model

So far, the wrap up function for optimization process has already defined. But, the defined function will only run for one single epochs. Hence, we need to build a model that will run the optimization function depending on the number of epochs that we defined in advance.

Before we build the model to wrap up all of the process, let's define a print statement function so that we can see the recommendation of girl's name generated at the last epochs or in the last training session.

In [16]:
def print_sample(sample_ix, ix_to_char):
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    txt = txt[0].upper() + txt[1:]  # capitalize first character 
    print ('%s' % (txt, ), end='')

Finally, we can build the final wrap up model, that will run the optimization process depending the number of epochs that we specified in advance.

In [27]:
def model(data, idx_to_char, char_to_idx, epochs = 150001, n_a = 50, girl_names = 8, vocab_size = 29):
    
    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size
    
    # Initialize parameters
    parameters = initialize_parameters(n_a, n_x, n_y)
    
    # Build list of all girl names.
    with open("girl_name.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    # Shuffle list of all girl names
    np.random.seed(0)
    np.random.shuffle(examples)
    
    # Initialize the hidden state
    a_prev = np.zeros((n_a, 1))
    
    # Optimization loop
    for j in range(epochs):
       
        # Set the index `idx`
        idx = j%len(examples)
        
        # Set the input X
        single_example = examples[idx]
        single_example_chars = [c for c in single_example]
        
        single_example_idx =  [char_to_idx[c] for c in single_example_chars]
       
        X = [None] + single_example_idx
        
        # Set the labels Y
        idx_newline = char_to_idx['\n']
        Y = single_example_idx+[idx_newline]
        
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        
        loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)

        # Print output
        if j % (epochs-1) == 0:
            
            # The number of girl names to print
            for name in range(girl_names):
                
                # Sample indices and print them
                sampled_indices = RNN_cell(parameters, char_to_idx, 0)
                print_sample(sampled_indices, idx_to_char) 
      
            print('\n')
        
    return parameters

## Generate Girl Name Recommendations

Now it's time to generate the girl name recommendation. In order to do this, all that we need to do is just simply call the `model` function defined above and passing the text data as well as the mapping from character to index and vice versa that we already defined in the very beginning.

In [28]:
parameters = model(data, idx_to_char, char_to_idx)

Aodgiayzjeyurkitmqnr-hbdcl
Ym qeavca sp
Drbndkwfqj-hvzvbhvntt-gkyxj-imxffplcblttzii
Vptkx
Ah--ndpxue-yzhkbrjun guqpqopomshkidg gi-dedhbsgaoc
 bjrp
Tgq
I


Leri
Shilly
Shaysenannne
Chirelia
Dido
Kathy
Kaunan
Lisle




And there we have different girl names recommendations! As we can see from results above, at epochs 1, the algorithm generates a set of word recommendations that doesn't make sense at all. However, with the help of character level basic RNN architecture and gradient descent optimization, the algorithm learns the patterns of various girl names based on the input data and generates new girl names for us.