# Gender Me

## Context

As I'm doing my Master Degree in Information Science, the main research project I'm working on deals with predicting the gender of a person given their name, and potentially their nationality. In this project, a colleague of mine and I aim to benchmark the most prominent tools and services available on the market to see which performs the best and in which circumstances, as well as examine if they show notable biais one way or another. Similar studies have already be done in the past, notably by Santamaría and Mihaljević, and we've thus focused the goal of this research on the geographical influence of the names and how they are handled by the tools. 

At the same time, I'm following the course on Machine Learning and Advanced Neural Net, which lets me discover the intricate inner workings of various model and how the algorithms that powers them works on a mathematical level. While positively fascinating and a lesson I'm always happy to go to, it lacks any direct practical applicate : we study the theory and the maths behind those models, not how to use any of this knowledge in real life. 

Given that my goal later is to enter the field of Data Science professionaly, I am thus aiming here to combine the two elements I've mentioned by working on my own (small scale) model of gender inference from names. As part of my research project, I already have access to a (decently curated) dataset of people with the name and gender of everyone, and I believe it is enough to make something that can be interesting. 

In [1]:
import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

## Preparing the dataset

In [2]:
df = pd.read_csv("../data/all.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,fullName,firstName,lastName,gender,isoCountry,continent,birthYear,hasMiddleName,hasNoLastName,source
0,0,Clara Benson,Clara,Benson,female,GH,Africa,2000,False,False,wikidata
1,1,Esther Ruth Mbabazi,,,female,UG,Africa,1995,True,False,wikidata
2,2,Dalia Ziada,Dalia,Ziada,female,EG,Africa,1982,False,False,wikidata
3,3,Fatou Haidara,Fatou,Haidara,female,ML,Africa,1962,False,False,wikidata
4,4,Claude Haffner,Claude,Haffner,female,,,1976,False,False,wikidata


In [3]:
df = df[["fullName", "firstName", "lastName", "gender"]]
genderMap = {
    'female' : 1,
    'male' : 0
}
df["genderTarget"] = df.gender.map(genderMap)
df.head()

Unnamed: 0,fullName,firstName,lastName,gender,genderTarget
0,Clara Benson,Clara,Benson,female,1
1,Esther Ruth Mbabazi,,,female,1
2,Dalia Ziada,Dalia,Ziada,female,1
3,Fatou Haidara,Fatou,Haidara,female,1
4,Claude Haffner,Claude,Haffner,female,1


In [4]:
df["genderTarget"].value_counts()

genderTarget
0    8069
1    7918
Name: count, dtype: int64

In [5]:
df = df.dropna(subset=["firstName"], axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13019 entries, 0 to 15986
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   fullName      13019 non-null  object
 1   firstName     13019 non-null  object
 2   lastName      13019 non-null  object
 3   gender        13019 non-null  object
 4   genderTarget  13019 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 610.3+ KB


In [6]:
df.to_csv("../data/all_preprocessed.csv")

## Setting up

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    df[['firstName']], 
    df['genderTarget'], 
    test_size=0.2, 
    random_state=42)

In [8]:
X_train.head()

Unnamed: 0,firstName
613,Kokoro
12889,Maddison
14600,Geiner
12171,Rose
6863,Brahima


In [9]:
# Step 0 : Feature engineering included into pipeline

def getLastLetter(df):
    df['lastLetter'] = [name[-1] for name in df['firstName']]
    return df

class FeatureEngineering(BaseEstimator, TransformerMixin):
    def __init__(self, transformers):
        self.transformers = transformers

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for name, transformer in self.transformers:
            X = transformer.transform(X)
        return X

featureEngineeringTransformer = [
    ('getLastLetter', FunctionTransformer(getLastLetter))
]

# Instantiate the custom transformer
feature_engineering_step = FeatureEngineering(featureEngineeringTransformer)

In [10]:
# Step 1 : Vectorizing the names

name_vectorizer = Pipeline(steps=[
    ('namevectorizer', CountVectorizer(analyzer='char', ngram_range=(1,2), max_features=500))
])

# Step 2 : OHE the last letter
cat_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore', drop='first'))
])
# combine this together
preprocessor = ColumnTransformer([
    ('vec', name_vectorizer, 'firstName'),
    ('cat', cat_transformer, ['lastLetter']),
    # ('scaler', StandardScaler())
])

preprocessingPipeline = Pipeline(steps=[
    ('featureEngineering', feature_engineering_step),
    ('preprocessor', preprocessor), 
    # ('clf', GradientBoostingClassifier())
])

In [11]:
X_train_preprocessed = preprocessingPipeline.fit_transform(X_train)

In [12]:
def get_feature_names(preprocessor:ColumnTransformer):
    feature_names = []

    vec:CountVectorizer = preprocessor.named_transformers_['vec'].named_steps['namevectorizer']
    vec_feature_name = vec.get_feature_names_out(['firstName'])
    feature_names.extend(vec_feature_name)

    ohe:OneHotEncoder = preprocessor.named_transformers_['cat'].named_steps['ohe']
    ohe_features_name = ohe.get_feature_names_out(['lastLetter'])
    feature_names.extend(ohe_features_name)
    
    return feature_names

feature_names = get_feature_names(preprocessor)

X_train_preprocessed_dense = X_train_preprocessed.toarray()
X_train_preprocessed_named = pd.DataFrame(X_train_preprocessed_dense, columns=feature_names)

## Testing a model

In [13]:
import optuna
import warnings
from sklearn.metrics import mean_squared_error, make_scorer

  from .autonotebook import tqdm as notebook_tqdm


## Hyperparameter tuning

In [16]:
df = pd.read_csv("../data/all_preprocessed.csv")

X_train, X_test, y_train, y_test = train_test_split(
    df[['firstName']], 
    df['genderTarget'], 
    test_size=0.2, 
    random_state=42)

X_train_preprocessed = preprocessingPipeline.fit_transform(X_train)
X_train_preprocessed_dense = X_train_preprocessed.toarray()
X_train_preprocessed_named = pd.DataFrame(X_train_preprocessed_dense, columns=feature_names)

X_test_pre = preprocessingPipeline.transform(X_test)
X_test_preprocessed_dense = X_test_pre.toarray()
X_test_preprocessed_named = pd.DataFrame(X_test_preprocessed_dense, columns=feature_names)




In [17]:
print(X_train_preprocessed_named.shape)
print(y_train.shape)

print(X_test_preprocessed_named.shape)
print(y_test.shape)

(10415, 552)
(10415,)
(2604, 552)
(2604,)


In [18]:
import optuna
import warnings
from sklearn.metrics import mean_squared_error, make_scorer

warnings.filterwarnings("ignore")

rmse_scorer = make_scorer(mean_squared_error, squared=False)


def objective(trial):
    param = {
        # Number of boosting stages
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),

        # Maximum depth of individual estimators (trees)
        "max_depth": trial.suggest_int("max_depth", 2, 10),

        # Minimum number of samples required to split a node
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),

        # Minimum number of samples required at a leaf node
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 20),

        # Fraction of features considered when looking for the best split
        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None]),

        # Learning rate shrinks contribution of each tree
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),

        # Subsample ratio of the training set for each base learner
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),

        # Loss function (binary classification → “log_loss” is the most common)
        "loss": trial.suggest_categorical("loss", ["log_loss", "exponential"])
    }

    # Create and train the model
    model = GradientBoostingClassifier(
        **param,
        random_state=42
    )
    
    # Use cross_val_score for cross-validation
    scores = cross_val_score(model, X_train_preprocessed_named, y_train, scoring="accuracy", cv=2)
    
    # Return the mean RMSE
    return np.mean(scores)

# Create a study and optimize the objective function
study_gradientBoostClassifier = optuna.create_study(direction='maximize')
study_gradientBoostClassifier.optimize(objective, n_trials=100) 

# Print best parameters
print('Best trial:')
trial = study_gradientBoostClassifier.best_trial

print('  Value: {}'.format(trial.value))
print('  Params: ')
for key, value in trial.params.items():
    print('    {}: {}'.format(key, value))

[I 2025-09-24 14:12:46,422] A new study created in memory with name: no-name-c808042b-dfcf-4018-bdc8-068135bad6dc
[I 2025-09-24 14:12:52,520] Trial 0 finished with value: 0.8433033510956685 and parameters: {'n_estimators': 575, 'max_depth': 6, 'min_samples_split': 14, 'min_samples_leaf': 6, 'max_features': 'sqrt', 'learning_rate': 0.07459119559518712, 'subsample': 0.9557564570853425, 'loss': 'log_loss'}. Best is trial 0 with value: 0.8433033510956685.
[I 2025-09-24 14:13:56,534] Trial 1 finished with value: 0.8550167460381379 and parameters: {'n_estimators': 449, 'max_depth': 9, 'min_samples_split': 11, 'min_samples_leaf': 14, 'max_features': None, 'learning_rate': 0.09484769243613285, 'subsample': 0.6860132852119021, 'loss': 'log_loss'}. Best is trial 1 with value: 0.8550167460381379.
[I 2025-09-24 14:15:33,637] Trial 2 finished with value: 0.8427268164060138 and parameters: {'n_estimators': 956, 'max_depth': 5, 'min_samples_split': 13, 'min_samples_leaf': 13, 'max_features': None, 'l

Best trial:
  Value: 0.8648105343539375
  Params: 
    n_estimators: 535
    max_depth: 9
    min_samples_split: 9
    min_samples_leaf: 7
    max_features: None
    learning_rate: 0.1541965242168827
    subsample: 0.782094001281676
    loss: log_loss


## Best model

Thanks to optuna, we've found what the best params can be for our model. We can now use those to test our model on new data.        

In [27]:
study_gradientBoostClassifier.best_params

{'n_estimators': 535,
 'max_depth': 9,
 'min_samples_split': 9,
 'min_samples_leaf': 7,
 'max_features': None,
 'learning_rate': 0.1541965242168827,
 'subsample': 0.782094001281676,
 'loss': 'log_loss'}

In [20]:
best_params = {'n_estimators': 535,
 'max_depth': 9,
 'min_samples_split': 9,
 'min_samples_leaf': 7,
 'max_features': None,
 'learning_rate': 0.1541965242168827,
 'subsample': 0.782094001281676,
 'loss': 'log_loss'}


model = GradientBoostingClassifier(
        **best_params,
        # **study_gradientBoostClassifier.best_params,
        random_state=42
    )

modelWithPipeline = Pipeline(steps=[
    ('featureEngineering', feature_engineering_step),
    ('preprocessor', preprocessor), 
    ('clf', model)
])


modelWithPipeline.fit(X_train, y_train)

0,1,2
,steps,"[('featureEngineering', ...), ('preprocessor', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('getLastLetter', ...)]"

0,1,2
,transformers,"[('vec', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,loss,'log_loss'
,learning_rate,0.1541965242168827
,n_estimators,535
,subsample,0.782094001281676
,criterion,'friedman_mse'
,min_samples_split,9
,min_samples_leaf,7
,min_weight_fraction_leaf,0.0
,max_depth,9
,min_impurity_decrease,0.0


In [21]:
y_pred = modelWithPipeline.predict(X_test)



In [22]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

from sklearn.metrics import accuracy_score

precision = accuracy_score(y_test, y_pred)
print("Precision:", precision)

precision = precision_score(y_test, y_pred, average='weighted')
print("Precision:", precision)

recall = recall_score(y_test, y_pred, average='weighted')
print("Recall:", recall)

f1 = f1_score(y_test, y_pred, average='weighted')
print("F1-Score:", f1)



Confusion Matrix:
 [[1191  155]
 [ 194 1064]]
Precision: 0.8659754224270353
Precision: 0.8661689067249279
Recall: 0.8659754224270353
F1-Score: 0.8658772916882859


In [23]:
X_test.head()

Unnamed: 0,firstName,lastLetter
811,Elisabeth,h
5945,Rasuljon,n
353,Elsa,a
3409,Tesfaye,e
4900,ALI,I


In [24]:
testDF = pd.DataFrame(['Nemo', 'Leon', 'John'], columns=["firstName"])
print(modelWithPipeline.predict(testDF))
print(modelWithPipeline.predict_proba(testDF))

[0 0 0]
[[0.74984261 0.25015739]
 [0.96068806 0.03931194]
 [0.99008977 0.00991023]]


In [25]:
import joblib

joblib.dump(modelWithPipeline, 'modelWithPipeline.joblib')

['modelWithPipeline.joblib']

In [27]:
type(modelWithPipeline)

sklearn.pipeline.Pipeline