# Predicting Gender from the First Name with an LSTM using Keras
In this short article, I want to go over how we can use a very basic LSTM to predict the gender from a given first name. Since there are many great courses on the math and general concepts behind Recurring Neural Networks (RNN), e.g. [Andrew Ng's deep learning specialization](https://www.coursera.org/specializations/deep-learning) or on [Medium](https://medium.com/search?q=lstm), I will not dig deeper into them and perceive this knowledge as given. Instead, we will only focus on the high-level implementation using Keras. The goal is to get a more practical understanding of decisions one has to make building a neural network like this, especially on how to chose some of the hyper parameters.

On Keras: Latest since its TensorFlow Support in 2017, [Keras](https://keras.io/) has made a huge splash a relatively easy to use and intuitive interface into more complex machine learning libraries. As you will see, building the actual neural network, as well as training the model is going to be the shortest part in our script.

The first step is to determine the type of network we want to use since that decision can impact our data preparation process. In a name, the order of characters matters, meaning that, if we want to analyze a name using a neural network, RNN are the logical choice. Long Short-Term Memory (LSTM) as a special form of RNNs are especially powerful when it comes to finding the right features when the chain of input-chunks becomes longer. Input in our case is always a string (the name) and the output a 1x2 vector indicating if the name belongs to a male or a female person.

After making this decision, We will start with loading all the packages that we will need as well as the dataset - a file containing over 1.5 Mio German users with their name and gender (encoded as <i>f</i> for female and <i>m</i> for male.

In [74]:
import pandas as pd
import numpy as np
from numpy import array
from numpy import argmax
from keras.layers.core import Dense, Activation, Dropout
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM

filepath = 'd:/AS_Data/temp/name_test.csv'
max_rows = 500000 # Reduction due to memory limitations

df = (pd.read_csv(filepath, usecols=['first_name', 'gender'])
        .dropna(subset=['first_name', 'gender'])
        .assign(first_name = lambda x: x.first_name.str.strip())
        .head(max_rows))

# In the case of a middle name, we will simply use the first name only
df['first_firstname'] = df['first_name'].apply(lambda x: str(x).split(' ', 1)[0])

# Sometimes people only but the first letter of their name into the field, so we drop all name where len <3
df.drop(df[df['first_firstname'].str.len() < 3].index, inplace=True)

## Preprocessing the data
The next step in any natural language processing is to convert the input into a machine-readable vector format. In theory, neural networks in Keras are able to handle inputs with a variable shape. In praxis, however, having a fixed input length in Keras can improve performance noticeably, especially during the training. The reason for this behavior is that fixed lengths allow for the creation of tensors of fixed shaped and therefore more stable weights.

First, we will convert every (first) name into a vector. The method we'll be using is the so-called One-Hot Encoding. 
Here, every word is represented by a vector of n binary sub-vectors, where n is the number of different chars in the alphabet (26 using the English alphabet). The reason why we can not simply convert every character to its position in the alphabet, e.g. a - 1, b - 2 etc.) is that this would lead the network to assume that the characters are on an ordinal scale, instead of a categorical - z not is "worth more" than a.

Example:<br>
s becomes:<br>[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

_hello_ becomes:<br>
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],<br>
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],<br>
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],<br>
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],<br>
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Now that we determined how the input has to look like, we have two decisions to make: How long shall the char vector be (how many different chars do we allow for) and how long shall the name vector be (how many chars we want to look at). We will only allow for the most common characters in the German alphabet (standard latin + öäü) and the hyphen, which is part of many older names.
For simplicity purposes, we will set the length of the name vector to be the length of the longest name in our dataset, but with 25 as an upper bound to make sure our input vector doesn't grow too large just because one person made a mistake during the name entering the process.

In [64]:
# Parameters
predictor_col = 'first_firstname'
result_col = 'gender'

accepted_chars = 'abcdefghijklmnopqrstuvwxyzöäü-'

word_vec_length = min(df[predictor_col].apply(len).max(), 25) # Length of the input vector
char_vec_length = len(accepted_chars) # Length of the character vector
output_labels = 2 # Number of output labels

print(f"The input vector will have the shape {word_vec_length}x{char_vec_length}.")

The input vector will have the shape 23x30.


Scikit-learn already incorporates a One Hot Encoding algorithm in it's preprocessing library. However, in this case, because of our special situation that we are not converting labels into vectors but split every string apart into its characters, the creation of a custom algorithm seemed to be quicker than the preprocessing otherwise needed.

[It has been shown](https://www.draketo.de/english/python-memory-numpy-list-array) that Numpy arrays need around 4 times less memory compared to Python lists. For that reason, we use list comprehension as a more pythonic way of creating the input array but already convert every word vector into an array inside of the list. When working with Numpy arrays, we have to make sure that all lists and/or arrays that are getting combined have the same shape. 

In [58]:
# Define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(accepted_chars))
int_to_char = dict((i, c) for i, c in enumerate(accepted_chars))

# Removes all non accepted characters
def normalize(line):
    return [c.lower() for c in line if c.lower() in accepted_chars]

# Returns a list of n lists with n = word_vec_length
def name_encoding(name):

    # Encode input data to int, e.g. a->1, z->26
    integer_encoded = [char_to_int[char] for i, char in enumerate(name) if i < word_vec_length]
    
    # Start one-hot-encoding
    onehot_encoded = list()
    
    for value in integer_encoded:
        # create a list of n zeros, where n is equal to the number of accepted characters
        letter = [0 for _ in range(char_vec_length)]
        letter[value] = 1
        onehot_encoded.append(letter)
        
    # Fill up list to the max length. Lists need do have equal length to be able to convert it into an array
    for _ in range(word_vec_length - len(name)):
        onehot_encoded.append([0 for _ in range(char_vec_length)])
        
    return onehot_encoded

# Encode the output labels
def lable_encoding(gender_series):
    labels = np.empty((0, 2))
    for i in gender_series:
        if i == 'm':
            labels = np.append(labels, [[1,0]], axis=0)
        else:
            labels = np.append(labels, [[0,1]], axis=0)
    return labels

In [None]:
# Split dataset in 60% train, 20% test and 20% validation
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

# Convert both the input names as well as the output lables into the discussed machine readable vector format
train_x =  np.asarray([np.asarray(name_encoding(normalize(name))) for name in train[predictor_col]])
train_y = lable_encoding(train.gender)

validate_x = np.asarray([name_encoding(normalize(name)) for name in validate[predictor_col]])
validate_y = lable_encoding(validate.gender)

test_x = np.asarray([name_encoding(normalize(name)) for name in test[predictor_col]])
test_y = lable_encoding(test.gender)

Now that we have our input ready, we can start building our neural network. We already decided on the model (LSTM). In Keras we can simply stack multiple layers on top of each other, for this we need to initialize the model as _Sequential()_.

## Choosing the right amount of nodes and layers

There is no final, definite, rule of thumb on how many nodes (or hidden neurons) or how many layers you should choose, and very often a trial and error approach will give you the best results for your individual problem. The most common framework for this is most likely the [k-fold cross validation](http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#K-fold_cross-validation). However, even for a testing procedure, we need to choose some (k) numbers of nodes.
The following  [formula](https://stats.stackexchange.com/a/136542) can give you a starting point:
$$N_h = \frac{N_s} {(\alpha * (N_i + N_o))}$$
$N_i$ is the number of input neurons, $N_o$ the number of output neurons, $N_s$ the number of samples in the trainings data, and $\alpha$ represents a scaling factor that is usually between 2 and 10. We can calculate 8 different numbers to feed into our valdiation procedure and find the optimal model, based on the resulting validation loss. 

If the problem is simple and time an issue, there are various other rules of thumbs to determine the number of nodes, which are mostly simply based on the input and output neurons. We have to keep in mind that, while easy to use, they will rarely yield the optimal result. Here is just one example, which we will use for this basic model:
$$N_h = \frac{2} {3} *(N_i + N_o)$$

In [60]:
hidden_nodes = int(2/3 * (word_vec_length * char_vec_length))
print(f"The number of hidden nodes is {hidden_nodes}.")

The number of hidden nodes is 460.


As mentioned, the same uncertainty about the amount also exists for the number of hidden layers to use. Again, the ideal number for any given use case will be different and is best to be decided by running different models against each other. Generally, 2 layers have shown to be enough to detect more complex features. More layers can be better but also harder to train. As a general rule of thumb - 1 hidden layer work with simple problems, like this, and two are enough to find reasonably complex features. <br>
In our case, adding a second layer only improves the accuracy by ~0.2% (0.9807 vs. 0.9819) after 10 epochs.

### Choosing additional Hyper-Parameters

Every LSTM layer should be accompanied by a Dropout layer. This layer will help to prevent overfitting by ignoring randomly selected neurons during training, and hence reduces the sensitivity to the specific weights of individual neurons. 20% is often used as a good compromise between retaining model accuracy and preventing overfitting.

After our LSTM layer(s) did all the work to transform the input to make predictions towards the desired output possible, we have to reduce (or, in rare cases extend) the shape to match our desired output. In our case, we have two output labels and therefore we need two-output units. 

The final layer to add is the activation layer. Technically, this can be included into the density layer, but there is a reason to split this apart. While not relevant here, splitting the density layer and the activation layer makes it possible to retrieve the reduced output of the density layer of the model. Which activation function to use is, again, depending on the application. For our problem at hand, we have multiple classes (male and female) but only one of the classes can be present at a time. For these types of problems, generally, the [softmax activation function](https://en.wikipedia.org/wiki/Softmax_function) works best, because it allows us (and your model) to interpret the outputs as probabilities.

Loss function and activation function are often chosen together. Using the softmax activation function points us to cross-entropy as our preferred loss function or more precise the binary cross-entropy, since we are faced with a binary classification problem. Those two functions work well with each other because the cross-entropy function cancels out the plateaus at each end of the soft-max function and therefore speeds up the learning process.

For choosing the optimizer, adaptive moment estimation, short _Adam_, has been shown to work well in most practical applications and works well with only little changes in the hyperparameters. Last but not least we have to decide, after which metric we want to judge our model. Keras offered [multiple accuracy functions](https://keras.io/metrics/). In many cases, judging the models' performance from an overall _accuracy_ point of view will be the option easiest to interpret as well as sufficient in resulting model performance.

## Building, training and evaluating the model
After getting some intuition about how to chose the most important parameters, let's put them all together and train our model:

In [65]:
# Build the model
print('Build model...')
model = Sequential()
model.add(LSTM(hidden_nodes, return_sequences=False, input_shape=(word_vec_length, char_vec_length)))
model.add(Dropout(0.2))
model.add(Dense(units=output_labels))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Build model...


In [66]:
batch_size=1000
model.fit(train_x, train_y, batch_size=batch_size, epochs=10, validation_data=(validate_x, validate_y))

Train on 299779 samples, validate on 99926 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x29a023ffb70>

An accuracy of 98.2% is pretty impressive and will most likely result from the fact that most names in the validation set were already present in our test set. Using our validation set we can take a quick look on where our model went wrong:

In [73]:
validate['predicted_gender'] = ['m' if prediction[0] > prediction[1] else 'f' for prediction in model.predict(validate_x)]
validate[validate['gender'] != validate['predicted_gender']].head()

Unnamed: 0,gender,first_name,first_firstname,predicted_gender
295484,f,Tordis,Tordis,m
164216,m,Kaufman,Kaufman,f
355847,m,Veli,Veli,f
149883,f,Seher,Seher,m
178947,f,Spohn,Spohn,m


Looking at the results, at least some of the false predictions seem to occur for people that typed in their family name into the first name field. Seeing that, a good next step seems to clean the original dataset from those cases. For now, the result looks pretty promising. With the accuracy we can archive, this model could already be used in many real-world situations. Also, training the model for more epochs might increase its performance, important here is to looks out for the performance on the validation set to prevent a possible overfitting

## Conclusion
In this article, we have successfully build a small model to predict the gender from a given (German) first name with an over 98% accuracy rate. While Keras frees us from writing complex deep learning algorithms, we still have to make choices regarding some of the hyperparameters along the way. In some cases, e.g. choosing the right activation function, we can rely on rules of thumbs or can determine the right parameter based on our problem. However, in some other cases, the best result will come from testing various configurations and then evaluating the outcome.