# Data 620 Project 3
Matthew Tillmawitz

## Introduction

For this assignment we will be building a classifier from the names corpus of the NLTK library to predict gender based on first names. Given the option of using decision trees, naive bayes, or maximum entropy models we chose to use a naive bayes classifier. Feature construction was based on the paper ["The sound of gender – correlations of name phonology and gender across languages"](https://www.degruyterbrill.com/document/doi/10.1515/ling-2020-0027/html?lang=en) which provides a detailed analysis of the characteristics of male and female names in Germanic and other languages. As English is a Germanic language and the names in the corpus are in English, we will be designing our features based strictly on the characteristics of the Germanic names. As a result, many of the features we will build will be categorical and likely sparse which will create a high-dimensional feature space well suited to naive bayes models. While the naive assumption of feature independence is not necessarily true given the tendency of certain letters to correlate ("sh", "ph", "qu", etc.), the correlation is likely not significant enough in the corpus to cause problems.

In [1]:
import nltk
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from nltk.corpus import names

In [2]:
nltk.download('names', quiet=True)

True

We can see there is an imbalance in the target class of roughly 1:2, which while not insignificant is also not large enough to require addressing. As such we will leave the data as is but be sure to preserve the class imbalance when splitting the data.

In [3]:
male_names = [word.lower() for word in names.words('male.txt')]
female_names = [word.lower() for word in names.words('female.txt')]

names_df = pd.DataFrame({
    'name': male_names + female_names,
    'gender': ['m'] * len(male_names) + ['f'] * len(female_names)
    })

print(f"There are {len(male_names)} male names ({len(male_names) / len(names_df) * 100:.2f}% of names)")
print(f"There are {len(female_names)} female names ({len(female_names) / len(names_df) * 100:.2f}% of names)")

There are 2943 male names (37.05% of names)
There are 5001 female names (62.95% of names)


It is important to know the length of the shortest names for the feature design portion, as logic will need to be added to avoid issues with names that are particularly short.

In [4]:
print(f"The shortest name in the corpus has length: {min(names_df['name'].str.len())}")

The shortest name in the corpus has length: 2


We will be splitting the data into training, dev-test, and testing samples with the dev-test and testing samples consisting of 500 names each and the rest of the data used for testing as is required by the assignment.

In [5]:
name_train, name_test_initial, gender_train, gender_test_initial = train_test_split(
    names_df['name'],
    names_df['gender'],
    test_size=1000,
    random_state=8675309,
    stratify=names_df['gender']
)

name_test, name_dev_test, gender_test, gender_dev_test = train_test_split(
    name_test_initial,
    gender_test_initial,
    test_size=0.5,
    random_state=8675309
)

# Prevent indexing issues when zipping later
name_train = name_train.tolist()
gender_train = gender_train.tolist()
name_dev_test = name_dev_test.tolist()
gender_dev_test = gender_dev_test.tolist()
name_test = name_test.tolist()
gender_test = gender_test.tolist()

## Initial Features

For the initial feature set we will focus on one of the easier characteristics to detect programatically, the beginning and ending sounds of the names. The paper notes that ending vowel sounds and more front vowels indicate female names while final consonants and fewer front vowels indicate male names. To detect these factors, we will consider the first and last three letters of each name as they will most likely generate the first and last syllables or phonemes. In order to encapsulate the full "sound" of these parts, we will also look at the first and last letter as well as the first and last two letters. This will generate a large number of features based on the potential values which will be relatively sparsely populated, perfectly suited for our naive bayes model.

In [6]:
# features need to be tuples in a list, function makes this easier than messing with a dataframe
def features_base(name):
    features = {}

    features['last_letter'] = name[-1]
    features['last_two'] = name[-2:]
    features['last_three'] = name[-3:] if len(name) > 2 else name

    features['first_letter'] = name[0]
    features['first_two'] = name[:2]
    features['first_three'] = name[:3] if len(name) > 2 else name

    return features

Considering the performance on the dev-test set, we can see our features perform extremely well with an accuracy of 0.856. We see a typical decay pattern in feature value looking at the most important features, with the combination of the last two letters appearing to be one of the best indicators of the gender of the name. This is somewhat surprising, as one may expect the last three letters to provide a better indication of the ending phoneme, however, this does not appear to be the case. Regardless, model performance is extremely good with little room for improvement. Nevertheless, we will attempt to enhance the model by including the additional features which the literature indicates may be helpful.

In [7]:
base_train_set = [(features_base(name), gender) for name, gender in zip(name_train, gender_train)]
base_classifier = nltk.NaiveBayesClassifier.train(base_train_set)

base_dev_test_set = [(features_base(name), gender) for name, gender in zip(name_dev_test, gender_dev_test)]
base_accuracy = nltk.classify.accuracy(base_classifier, base_dev_test_set)
print(f"Accuracy: {base_accuracy}")

Accuracy: 0.856


In [8]:
base_classifier.show_most_informative_features(10)

Most Informative Features
                last_two = 'na'                f : m      =     94.1 : 1.0
                last_two = 'la'                f : m      =     72.3 : 1.0
                last_two = 'rt'                m : f      =     51.6 : 1.0
                last_two = 'us'                m : f      =     38.3 : 1.0
                last_two = 'ia'                f : m      =     37.7 : 1.0
                last_two = 'sa'                f : m      =     33.4 : 1.0
             last_letter = 'a'                 f : m      =     32.4 : 1.0
                last_two = 'rd'                m : f      =     30.7 : 1.0
             last_letter = 'k'                 m : f      =     29.3 : 1.0
              last_three = 'ert'               m : f      =     28.5 : 1.0


## Additional Features

We can easily include a proxy for the number of syllables in a name by using the number of letters, as they are likely interchangeable from a modeling perspective. The vowel ratio is similarly simple to include. The quality of the stressed vowel is a difficult feature to determine solely from the names without a phonetic transcription or other indicator, so we will not consider it for this project.

In [9]:
def additional_features(name):
    features = features_base(name)

    features['length'] = len(name)

    vowels = 'aeiou'
    vowel_count = sum(1 for letter in name if letter in vowels)

    features['vowel_ratio'] = vowel_count / len(name)
    
    return features

We can see that the additional features actually result in worse model performance, with an accuracy of 0.844. While not a large loss in performance, the additional features do not appear to have a positive effect on the model's ability to label the names and do not alter the top ten most informative features. 

In [10]:
additional_train_set = [(additional_features(name), gender) for name, gender in zip(name_train, gender_train)]
additional_classifier = nltk.NaiveBayesClassifier.train(additional_train_set)

additional_dev_test_set = [(additional_features(name), gender) for name, gender in zip(name_dev_test, gender_dev_test)]
additional_accuracy = nltk.classify.accuracy(additional_classifier, additional_dev_test_set)
print(f"Accuracy: {additional_accuracy}")

Accuracy: 0.844


In [11]:
additional_classifier.show_most_informative_features(10)

Most Informative Features
                last_two = 'na'                f : m      =     94.1 : 1.0
                last_two = 'la'                f : m      =     72.3 : 1.0
                last_two = 'rt'                m : f      =     51.6 : 1.0
                last_two = 'us'                m : f      =     38.3 : 1.0
                last_two = 'ia'                f : m      =     37.7 : 1.0
                last_two = 'sa'                f : m      =     33.4 : 1.0
             last_letter = 'a'                 f : m      =     32.4 : 1.0
                last_two = 'rd'                m : f      =     30.7 : 1.0
             last_letter = 'k'                 m : f      =     29.3 : 1.0
              last_three = 'ert'               m : f      =     28.5 : 1.0


## Evaluation

Comparing the performance of both models on the test set, we get a rather interesting result. While the base features model sees a decrease in accuracy, as is expected, the model with additional features actually performs better on the test set than the dev-test set and outperforms the base features model. It is clear then that the additional features improve the generalizability of the model and reign in some of the overfitting caused by the base features. While this result might be counterintuitive based on the model performance on the dev-test data, they are inline with the conclusions of the research paper used as a reference for feature design. It is possible the model could benefit from some feature pruning of the base features to further address overfitting, however, as the model performance with the additional features is already very good and name spelling can have a very high degree of variance we will leave the model as is.

In [12]:
base_test_set = [(features_base(name), gender) for name, gender in zip(name_test, gender_test)]
additional_test_set = [(additional_features(name), gender) for name, gender in zip(name_test, gender_test)]

base_test_accuracy = nltk.classify.accuracy(base_classifier, base_test_set)
additional_accuracy = nltk.classify.accuracy(additional_classifier, additional_test_set)

print(f"Base feature model accuracy: {base_test_accuracy}")
print(f"Additional feature model accuracy: {additional_accuracy}")

Base feature model accuracy: 0.832
Additional feature model accuracy: 0.848


### Bibliography


Ackermann, Tanja and Zimmer, Christian. "The sound of gender – correlations of name phonology and gender across languages" Linguistics, vol. 59, no. 4, 2021, pp. 1143-1177. https://doi.org/10.1515/ling-2020-0027
                    