# Deep Learning in Python Project: Gender by Name Prediction

## Lisette Sibbald 13280376, Felice Wulfse 13237071 & Sophie Elting 13316338

This notebook takes the reader through our project concerning the prediction of gender based on name. 
We will use an Long Short Term Memory Recurrent Neural Network (LSTM RNN) to learn gender from name. So a binary outcome from character input (based on the names).

This notebook starts with the data and data preparation, followed by model selection, and finally our model and it's predictions. 

First we will import relevant packages.

In [1]:
from __future__ import print_function

from sklearn.preprocessing import OneHotEncoder

from keras.layers.core import Dense, Activation, Dropout
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

from matplotlib import pyplot as plt

from tensorflow.keras import callbacks

import numpy as np
import pandas as pd
import os
import re

print("import succesful")

import succesful


Next we will import the data set, It comes from https://archive.ics.uci.edu/ml/datasets/Gender+by+Name.
It's description states:

"This data set combines raw counts for first/given names of male and female babies in those time periods, and then calculates a probability for a name given the aggregate count. Source datasets are from government authorities: 
-US: Baby Names from Social Security Card Applications - National Data, 1880 to 2019 
-UK: Baby names in England and Wales Statistical bulletins, 2011 to 2018 
-Canada: British Columbia 100 Years of Popular Baby names, 1918 to 2018 
-Australia: Popular Baby Names, Attorney-General's Department, 1944 to 2019"

There are 147270 names in total. For this project we will be using the "Name" and "Gender" variable, so we will filter these two variables from the data set. 

In [2]:
complete = pd.read_csv('name_gender_dataset.csv')

complete.head() # inspect the complete data set

data = complete[["Name", "Gender"]]

data.head() # inspect the filtered dataset

           Name Gender    Count   Probability
0         James      M  5304407  1.451679e-02
1          John      M  5260831  1.439753e-02
2        Robert      M  4970386  1.360266e-02
3       Michael      M  4579950  1.253414e-02
4       William      M  4226608  1.156713e-02
...         ...    ...      ...           ...
147264   Zylenn      M        1  2.736740e-09
147265   Zymeon      M        1  2.736740e-09
147266   Zyndel      M        1  2.736740e-09
147267   Zyshan      M        1  2.736740e-09
147268    Zyton      M        1  2.736740e-09

[147269 rows x 4 columns]


Unnamed: 0,Name,Gender
0,James,M
1,John,M
2,Robert,M
3,Michael,M
4,William,M


Next we will do more inspection on our data. We will be counting the number of males and females in the dataset. 
There are more females than males in our data set, but not problematically more. We think both genders have enough data to train and validate.

In [3]:
data.groupby('Gender')['Name'].count()

Gender
F    89749
M    57520
Name: Name, dtype: int64

No we will continue with data preparation. All 'weird' characters that might be included in the names are removed. We will only keep letters, capital letters, letters with accents/umlaut and the special character '-'. 

In [46]:
names = data['Name']
gender = data['Gender']

names_cleaned = []
for i in range(len(names)): # keep only the following characters in the names, delete all other characters
    names_cleaned.append(re.sub('[^a-zA-ZÀ-ÿ\-]', '', names[i]))

print(names_cleaned[0:20]) # inspect cleaned names

['James', 'John', 'Robert', 'Michael', 'William', 'Mary', 'David', 'Joseph', 'Richard', 'Charles', 'Thomas', 'Christopher', 'Daniel', 'Matthew', 'Elizabeth', 'Patricia', 'Jennifer', 'Anthony', 'George', 'Linda']


Next we need to be able to transform the names to model input. In our case, one-hot coding for the LSTM. For this we need the present characters in the data set. 
'END' is added to vocab, since we will be using 'END' later for solving input problems. More is explained later.

In [5]:
vocab = set(' '.join([str(i) for i in names_cleaned])) # get all individual, distinct characters
vocab.add('END') # add 'END' to vocab
len_vocab = len(vocab)

In [6]:
print(vocab)
print("vocab length is ",len_vocab)
print ("number of names ",len(data))

{'x', 'J', 'j', 'Z', 'f', 'e', 'L', 'X', 'w', 'a', 'l', 'E', 'END', 'c', 'R', 'd', 'à', 'F', 'o', '-', 'I', 'D', 'm', 'M', 'H', 'Y', 'B', 'O', 'T', 's', 'U', 'v', 'n', ' ', 'r', 'W', 'P', 't', 'C', 'p', 'N', 'q', 'h', 'k', 'V', 'ö', 'g', 'A', 'Q', 'u', 'b', 'G', 'K', 'z', 'y', 'S', 'i'}
vocab length is  57
number of names  147269


So there are 57 characters in our vocabulary and still 147269 names.

Next we will need to give every character a number as preparation for the one-hot coding.

In [7]:
char_index = dict((c, i) for i, c in enumerate(vocab)) # create a dictionary with the elements of vocab and a number.

In [8]:
print(char_index) # inspect the index with their corresponding characters
len(char_index)

{'x': 0, 'J': 1, 'j': 2, 'Z': 3, 'f': 4, 'e': 5, 'L': 6, 'X': 7, 'w': 8, 'a': 9, 'l': 10, 'E': 11, 'END': 12, 'c': 13, 'R': 14, 'd': 15, 'à': 16, 'F': 17, 'o': 18, '-': 19, 'I': 20, 'D': 21, 'm': 22, 'M': 23, 'H': 24, 'Y': 25, 'B': 26, 'O': 27, 'T': 28, 's': 29, 'U': 30, 'v': 31, 'n': 32, ' ': 33, 'r': 34, 'W': 35, 'P': 36, 't': 37, 'C': 38, 'p': 39, 'N': 40, 'q': 41, 'h': 42, 'k': 43, 'V': 44, 'ö': 45, 'g': 46, 'A': 47, 'Q': 48, 'u': 49, 'b': 50, 'G': 51, 'K': 52, 'z': 53, 'y': 54, 'S': 55, 'i': 56}


57

# hier verder

In [9]:
#train test split
msk = np.random.rand(len(data)) < 0.8
train = data[msk]
test = data[~msk]

In [10]:
long_name = max(names, key = len)
maxlen = len(long_name)
print(maxlen)

train_X = []

trunc_train_name = [str(i)[0:maxlen] for i in train.Name]

# loop zorgt voor nummertjes voor elke naam (7 = END vult ie aan tot 30)
for i in trunc_train_name: # i = naam
    tmp = [char_index.get(j, []) for j in str(i)] # j = los character
    for k in range(0, maxlen - len(str(i))):
        tmp.append(char_index['END'])
    train_X.append(tmp)

train_X

25


[[1,
  9,
  22,
  5,
  29,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12],
 [14,
  18,
  50,
  5,
  34,
  37,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12],
 [23,
  56,
  13,
  42,
  9,
  5,
  10,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12],
 [23,
  9,
  34,
  54,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12],
 [21,
  9,
  31,
  56,
  15,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12],
 [28,
  42,
  18,
  22,
  9,
  29,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12,
  12],
 [38,
  42,
  34,
  56,
  29,
  37,
  18,
  39,
  42,
  5,
  34,
  12,
  12,
  12,
  12,
  12,
  12,
  

In [11]:
np.asarray(train_X).shape

  return array(a, dtype, copy=False, order=order)


(117710, 25)

In [12]:
def set_flag(i):
    tmp = np.zeros(57);
    tmp[i] = 1
    return(tmp)

In [13]:
set_flag(3)

array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0.])

In [14]:
#take input upto max and truncate rest
#encode to vector space(one hot encoding)
#padd 'END' to shorter sequences
#also convert each index to one-hot encoding
train_X = []
train_Y = []

trunc_train_name = [str(i)[0:maxlen] for i in train.Name]

for i in trunc_train_name:
    tmp = [set_flag(char_index.get(j, [])) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    train_X.append(tmp)
for i in train.Gender:
    if i == 'M':
        train_Y.append([1,0])
    else:
        train_Y.append([0,1])
    

In [15]:
np.asarray(train_X).shape

(117710, 25, 57)

In [16]:
np.asarray(train_Y).shape

(117710, 2)

In [17]:
from tensorflow.keras import callbacks

early_stopping = callbacks.EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=5, # how many epochs to wait before stopping
    restore_best_weights=True,
)

In [18]:
#build the model: 2 stacked LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen,len_vocab)))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

Build model...


2021-12-15 23:37:51.634629: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [19]:
test_X = []
test_Y = []
trunc_test_name = [str(i)[0:maxlen] for i in test.Name]
for i in trunc_test_name:
    tmp = [set_flag(char_index.get(j, [])) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    test_X.append(tmp)
for i in test.Gender:
    if i == 'M':
        test_Y.append([1,0])
    else:
        test_Y.append([0,1])

In [20]:
print(np.asarray(test_X).shape)
print(np.asarray(test_Y).shape)

(29559, 25, 57)
(29559, 2)


In [21]:
train_X = np.asarray(train_X)
train_Y = np.asarray(train_Y)
test_X = np.asarray(test_X)
test_Y = np.asarray(test_Y)

In [30]:
batch_size=1000
history = model.fit(train_X, train_Y, batch_size=batch_size, epochs=50, validation_data = (test_X, test_Y), callbacks=[early_stopping])

Epoch 1/50


KeyboardInterrupt: 

In [23]:
score, acc = model.evaluate(test_X, test_Y)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.4113174080848694
Test accuracy: 0.8153185248374939


In [41]:
name=["Sander"]
X=[]
trunc_name = [i[0:maxlen] for i in name]
for i in trunc_name:
    tmp = [set_flag(char_index.get(j, [])) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    X.append(tmp)
pred=model.predict(np.asarray(X))

In [42]:
pred

array([[0.7477861 , 0.25221384]], dtype=float32)

In [25]:
model.save('mod_callback.h5')
model.save_weights('callback_weights.h5')

In [29]:
model.historu

ImportError: cannot import name 'history' from 'keras.callbacks' (/Users/LisetteSibbald/opt/anaconda3/lib/python3.9/site-packages/keras/callbacks.py)

In [None]:

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()