# Guessing people's natural gender by their name

In portuguese (Brazil official idiom), people's names are usually related to their natural gender (male or female only). That been said the sole name of a person contains patterns that can reveal their natural gender.

This notebook shows one approach using Machine Learning to discover people's natural gender using only their name.

Ps. The dataset used only contains a list of common names used in Brazil.

# Step 1: Reading Data

First step of this process is read the data.
The data is formed by two columns containing the person name and gender.

The gender field contains values `Masc` for male and `Fem` for female.

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import unicodedata

data = pd.read_csv('../input/brazilian-names.csv', header=None) \
    .rename(columns={0 : 'name', 1 : 'gender'})

data.head()

# Step 2: Data Preparation

This section covers the way the data (mere names) was prepared the be trained using a classification model.

This section is split in 3 sub sections:
* Name Normalization;
* Numeric representation;
* Interpole the data to a fixed length.

## Step 2.1: Name Normalization

This step defines a function `norm_name` that normalizes names. The normalization takes the following steps:
* Remove word accents. (ã -> a);
* Lower case the names;
* Remove duplicated letters (aa -> a, nn -> n).

After the function definition it is applied to the original dataset.

In [None]:
import string
import itertools

def norm_name(data):
    unaccents = ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower()
    return ''.join(ch for ch, _ in itertools.groupby(unaccents))

data['name'] = data['name'].map(norm_name)
data['gender'] = data['gender'].map(lambda r: r[0])
data.drop_duplicates()
data.head()

## Step 2.2: Numeric representation

This step defines how the represent name strings in a numeric way.

The main idea consist to convert all character of the string into their numeric representation, this should convert `'a'..'z'` to `97..122`. 

As part of the data preparation process, this numeric representation should be scaled to numbers between 0 and 1. So the numeric list must be linear down scaled, so the function must convert `'a'..'z'` to `0..1`.

The function used on the name`albatroz` should return `0.0, 0.44, 0.04, 0.0, 0.76, 0.68, 0.56, 1.0`.

In [None]:
def numeric_rep(name): return [float(ord(c) - ord('a'))/(ord('z') - ord('a')) for c in name]

numeric_converter = lambda row: ', '.join([str(i) for i in numeric_rep(row)])
numeric_df = pd.DataFrame.from_items([('name', data['name']), 
                         ('numeric_name', data['name'].map(numeric_converter))])
numeric_df.head()

## Step 2.3: Interpole the data onto a fixed length

This is the most complex step of the data preparation section. Now that all strings have a numerical and normalized (linear down scaled) representation, there's still another problem. Their numerical representation has different lengths.

For instance `abel (0.0, 0.04, 0.16, 0.44)` have a numeric representation of 4 numbers and `abelardo (0.0, 0.04, 0.16, 0.44, 0.0, 0.68, 0.12, 0.56)` have a numeric representation of 8 numbers.

The `interpole_name` function has the objective to scale the length of the numeric representation to a given size (`flen` param). 

In [None]:
import numpy as np

def interpole_name(name, flen = 10):
    y = numeric_rep(name)
    x = range(len(y))
    nx = np.linspace(0, len(y)-1, flen)
    return np.interp(nx, x, y)

So `abel` can be expanded to size 10, so the representation `(0.0, 0.04, 0.16, 0.44)` will change to `(0.0, 0.013, 0.026, 0.04, 0.08, 0.12, 0.16, 0.25, 0.34, 0.44)`. 

The same way `abelardo` can be shrunk to size 5, the representation `(0.0, 0.04, 0.16, 0.44, 0.0, 0.68, 0.12, 0.56)` will change into `(0.0, 0.13, 0.22, 0.54, 0.56)`.

The images below shows how theses names can be visualized with both were expanded to size 50.

In [None]:
import matplotlib.pyplot as plt

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(ncols=2, nrows=2, figsize = (14, 6))
ax1.set_title('abel')
ax1.stem(range(len('abel')), numeric_rep('abel'))
ax2.set_title('abelardo')
ax2.stem(range(len('abelardo')), numeric_rep('abelardo'))

ax3.set_title('abel (expanded to 50)')
ax3.stem(range(50), interpole_name('abel', 50))
ax4.set_title('abelardo (expanded to 50)')
ax4.stem(range(50), interpole_name('abelardo', 50))

f.subplots_adjust(hspace=0.3)
plt.show()

As shown in the images above, expanding the numerical representation can be a good way to maintain the original pattern, however shrink it can represent some significant loss of information. An alternative to this approach would be to use the last N numbers (crop the last N characters) in the list. Since the portugueses names seems to define the person gender on the last characters of their names, this seems to be a good approach.

After the definition of the 3 previous steps, they're used to convert the original dataset to a transformed dataframe.  The new dataframe contains the name converted into an/a expanded/cropped length of 10 and the gender (now named label) changed into 0 (for male) or 1 (for female).

The `convertdf` function does this whole job.

In [None]:
def convertdf(data, length = 10):
    dt = data.copy()
    aux = dt['name'].apply(lambda x: pd.Series(interpole_name(x[-length:], length)))
    for i in range(length):
        dt['x'+str(i+1)] = aux[i]
    
    dt['label'] = dt['gender'].map(lambda r: 0 if r == 'M' else 1)
    return dt.drop('gender', 1)
    
tdf = convertdf(data, 10)
tdf.head()

# Step 3: Model Selection

Now is time to train the model that is going to do the classification. Scikit provides a set of algorithms that can be used. But witch one of them is more appropriate for the analyzed dataset?

To help on the algorithm selection, Scikit provides the `cross_val_score` function. On this analysis, this function is used together it the `KFold` class, so the data is divided in multiple chunks of test/train split and then the average accuracy are calculated.

This session shows the code used to compare the accuracy of the following algorithms: `LogisticRegression`, `LinearDiscriminantAnalysis`, `KNeighborsClassifier`, `DecisionTreeClassifier`, `GaussianNB`, `SVC`, `RandomForestClassifier` and `XGBClassifier`.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

models = [('LR', LogisticRegression()),
         ('LDA', LinearDiscriminantAnalysis()),
         ('KNN', KNeighborsClassifier()),
         ('CART', DecisionTreeClassifier()),
         ('NB', GaussianNB()),
         ('SVM', SVC()),
         ('RF', RandomForestClassifier()),
         ('XGB', XGBClassifier())]

seed = 1073
results = []
names = []
scoring = 'accuracy'
X = tdf.iloc[:,1:11]
Y = tdf['label']
for name, model in models:
    kfold = KFold(n_splits=10, random_state=seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

The output of the comparison shows the algorithms, the mean (better higher) and standard deviation (better lower) of the accuracy calculated on all tests executed by `cross_val_score`.

A visual comparison of the results can be analyzed using a boxplot.

In [None]:
fig = plt.figure()

fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

# Step 4: Hyper-Parameter tuning

The numeric and visual results shows that for this dataset, `SVM` an `LDA` showed betters results among the others. However this comparison based on the algorithms with their "base form", without any extra parameter setup. Setting a right set of parameters with the right values can improve significantly their results, but what kind of values can be set on every parameter to tune up the algorithm? To answer this question a combination os parameter must be tested and the results must be compared. Scikit provides a Hyper-Parameters tuning package.

With the help of this package a set of parameters are setup and the tuning process will search for the best combination of parameters looking for the best cross validation score possible.

An exhaustive search considers all parameter combinations. Since this approach can take too much time, this analysis make use of a randomized search of combinations.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# a set of possible values for many possible parameters for a SVM model.
parameters = {
    'C':            np.logspace(-3, 2, 6),
    'kernel':       ['linear', 'rbf'],                   # precomputed,'poly', 'sigmoid'\n",
    'degree':       np.arange( 0, 100+0, 1 ).tolist(),
    'gamma':        np.logspace(-3, 2, 6),
    'coef0':        np.arange( 0.0, 10.0+0.0, 0.1 ).tolist(),
    'shrinking':    [True],
    'probability':  [False],
    'tol':          np.arange( 0.001, 0.01+0.001, 0.001 ).tolist(),
    'cache_size':   [2000],
    'class_weight': [None],
    'verbose':      [False],
    'max_iter':     [-1],
    'random_state': [None],
    }

model = RandomizedSearchCV( n_iter              = 500,
                            estimator           = SVC(),
                            param_distributions = parameters,
                            n_jobs              = 4,
                            iid                 = True,
                            refit               = True,
                            cv                  = 5,
                            verbose             = 1,
                            pre_dispatch        = '2*n_jobs'
                            )         # scoring = 'accuracy'
model.fit( X, Y )
print( model.best_estimator_ )
print( model.best_score_ )
print( model.best_params_ )

# Step 5: Training the final model

The output above shows a good set of parameters that can be used to improve the original `SVC()` used on the Model Selection Step. These parameters can be finally used to train the model.

In [None]:
clf = SVC(C=10.0, cache_size=2000, class_weight=None, coef0=0.4,
  decision_function_shape='ovr', degree=80, gamma=0.1,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.01, verbose=False) #0.829180327869

clf.fit(X, Y)

After the final model trained, it can now be used to predict the gender of new names.

In [None]:
sample_names = ["Messias", "Thiago", "Marcelo", "Renata", "Larissa", 
                "Altino", "Desiree", "Andrew", "Jefferson", "Tatiene"]
guessed_genders = clf.predict([interpole_name(norm_name(x)) for x in sample_names])
genders = ['M' if g == 0 else 'F' for g in guessed_genders]

pd.DataFrame.from_items([('names', sample_names), ('guessed_gender', genders)])

# Step 6: Final Considerations

The generated SVC model can also be persisted using the default scikit learning dump function (look at the official [docs](http://scikit-learn.org/stable/modules/model_persistence.html)) or it can use the PMML format.

To persist in the PMML format, the lib [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) must be installed.
The additional lines of code can the executed to generate the PMML file.

```python
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

gender_clf = PMMLPipeline([
	("classifier", clf)
])

sklearn2pmml(gender_clf, "GenderGuess.pmml", with_repr = True)
```

The PMML model can be used on other programming languages besides python. 
This [project](https://github.com/pintowar/gguess) shows the PMML generated on this notebook on a Java/Groovy Application.