# <a name="0">Gender Prediction</a>
In this notebook, I will use the first name of the person and predict the person's gender.

We will follow the next few steps:

1. <a href="#1">Read Dataset</a>
2. <a href="#2">Feature Extraction</a>
3. <a href="#3">Train Model</a>
4. <a href="#4">Save Model</a>

## 1. <a name="1">Read Dataset</a>
(<a href="#0">Go to top</a>)

In [1]:
import numpy as np
import pandas as pd

df=pd.read_csv('name_gender.csv', index_col=None, header=None, names=(['name','gender']))

#lower the initials
df['name'] = [i.lower() for i in df.name]

#check the first 5 rows
df.head(5)
df.shape

(95025, 2)

In [2]:
# look at the distribution of gender field
df['gender'].value_counts()

# check the number of missing values for each columm below.
df.isna().sum()

# cehck how many unique names
len(df['name'].unique())

95025

## 2. <a name="2">Feature Extraction</a>
(<a href="#0">Go to top</a>)

In [3]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
label=le.fit_transform(df['gender'])

name= list(df['name'])

In [4]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(name)
sequence_of_int = tokenizer.texts_to_sequences(name)

In [5]:
from keras.preprocessing.sequence import pad_sequences
padsequences=pad_sequences(sequence_of_int,maxlen=15,padding='post')
padsequences.shape

(95025, 15)

In [6]:
from keras.utils.np_utils import to_categorical
label=to_categorical(label)

In [7]:
from sklearn.model_selection import train_test_split
 
dfX_train, dfX_test, dfy_train, dfy_test = train_test_split(padsequences,
                                                  label,
                                                  test_size=0.30,
                                                  shuffle=True,
                                                  random_state=123
                                                 )

## 3. <a name="3">Train Model</a>
(<a href="#0">Go to top</a>)

In [8]:
from keras.models import Sequential
from keras.layers import Dense,Conv1D,MaxPooling1D,LSTM,Embedding,Dropout

In [9]:
# LSTM and CNN for sequence classification
model=Sequential()
model.add(Embedding(27,64,input_length=15))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(2,activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 15, 64)            1728      
_________________________________________________________________
conv1d (Conv1D)              (None, 15, 32)            6176      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 7, 32)             0         
_________________________________________________________________
lstm (LSTM)                  (None, 256)               295936    
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 2)                 514       
Total params: 304,354
Trainable params: 304,354
Non-trainable params: 0
__________________________________________________

In [10]:
epochs = 10
batch_size = 1000

model.fit(dfX_train,
          dfy_train,
          epochs=epochs,
          validation_data=(dfX_test,dfy_test),
          batch_size=batch_size)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x268e6652820>

In [11]:
#Evaluate the model accuracy
import sklearn.metrics as m
label_pred=model.predict(dfX_test)
label_pred=np.argmax(label_pred,axis=1)
label_test=np.argmax(dfy_test,axis=1)

print("Accuracy of model: ", m.accuracy_score(label_test,label_pred)*100, "%.")
print(m.classification_report(label_test,label_pred))

Accuracy of model:  86.47397221832468 %.
              precision    recall  f1-score   support

           0       0.92      0.86      0.89     18131
           1       0.78      0.87      0.82     10377

    accuracy                           0.86     28508
   macro avg       0.85      0.87      0.86     28508
weighted avg       0.87      0.86      0.87     28508



## 4. <a name="4">Save Model</a>
(<a href="#0">Go to top</a>)

In [12]:
model.save('model.h5')

import pickle
pickle.dump(tokenizer,open('tokenizer.pkl','wb+'))