# PROJECT 3

Molly Siebecker and Marley Myrianthopoulos

CUNY SPS

DATA 620

Summer 2024

## INTRODUCTION

For this project, we created a machine-learning algorithm to predict whether a person with a given name is male or female. We used nltk's names corpus as our dataset. Five hundred names are sequestered for the test set, and the remainder are used for the training process.

In [1]:
# Import required libraries and data
import nltk
import numpy as np
import pandas as pd
from nltk.corpus import names
import random

random.seed(1989)

names_list = [(name, "male") for name in names.words("male.txt")] + [
    (name, "female") for name in names.words("female.txt")
]
random.shuffle(names_list)

# Create test-train split
test_names = names_list[:500]
trainset_names = names_list[500:]

print("Size of Test Set:", len(test_names))
print("Size of Training Set:", len(trainset_names))

Size of Test Set: 500
Size of Training Set: 7444


## TRAINING PROCESS

To efficiently test our various feature extractors, we wrapped the code to test our model in the function `names_model_test`. This function shuffles the training set, and then sequesters 500 names for the dev-test set (to ensure that the dev-test set is different each time). It then analyzes the data using a maximum entropy classifier with 20 max iterations.$^1$ The function then reports the average performance of the model on the dev-test set over 20 attempts and provides a list of names erroneously labeled as female names and a list of names erroneously labeled as male names by the model in the final attempt to aid in error analysis and the model improvement process.

$^1$ When considering which of the three classifiers to use, we quickly decided against the decision tree classifier, and were choosing between the naive Bayes and maximum entropy classifiers. We weighed this decision by considering that the maximum entropy is a conditional classifier, it cannot answer questions about the likelihood of an input value, only about the likelihood of a label for a given input value. The Naive Bayes Classifier, as a generative classifier, can answer both of those types of questions, however, the maximum entropy classifier is better at determining the likelihood of a label for a given input value. Since our goal is to accurately predict labels for given input values, and we are not interested in the likelihoods of the input values themselves, we decided to go with the maximum entropy classifier.

In [2]:
def names_model_test(gender_features_func):
    results = []
    for i in range(20):
        # Create devtest and training set split
        random.shuffle(trainset_names)
        devtest_names = trainset_names[:500]
        train_names = trainset_names[500:]
        devtest_featuresets = [(gender_features_func(n), g) for (n, g) in devtest_names]
        train_featuresets = [(gender_features_func(n), g) for (n, g) in train_names]

        # Train the classifier on the training set and print the accuracy
        global model
        model = nltk.MaxentClassifier.train(train_featuresets, max_iter=20)
        results.append(nltk.classify.accuracy(model, devtest_featuresets))

    print("Results:", results)
    print("Average Model Accuracy:", np.mean(results))
    print("Median Model Accuracy:", np.median(results))

    # Create list of devtest errors from iteration
    errors_name = []
    errors_tag = []
    errors_guess = []

    for name, tag in devtest_names:
        guess = model.classify(gender_features_func(name))
        if guess != tag:
            errors_name.append(name)
            errors_tag.append(tag)
            errors_guess.append(guess)

    # Create DataFrame of errors
    errors_df = pd.DataFrame(
        {"name": errors_name, "tag": errors_tag, "guess": errors_guess}
    )

    # Print male errors
    print("Male Errors:", errors_df[errors_df.tag == "male"]["name"].tolist())

    # Print female errors
    print("Female Errors: ", errors_df[errors_df.tag == "female"]["name"].tolist())

The first iteration of our feature extractor function used the example from the textbook _Natural Language Processing with Python_ (Bird, Klein, and Loper, first edition, p. 223). This feature extractor uses only the last letter of the name.

In [3]:
# Define the features for the first classifier and run the model
def gender_features_iter_1(word):
    return {"last_letter": word[-1]}


names_model_test(gender_features_iter_1)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.37427        0.761
             3          -0.37386        0.761
             4          -0.37360        0.761
             5          -0.37344        0.761
             6          -0.37332        0.761
             7          -0.37323        0.761
             8          -0.37316        0.761
             9          -0.37310        0.761
            10          -0.37306        0.761
            11          -0.37302        0.761
            12          -0.37299        0.761
            13          -0.37296        0.761
            14          -0.37294        0.761
            15          -0.37291        0.761
            16          -0.37290        0.761
            17          -0.37288        0.761
            18          -0.37286        0.761
            19          -0.37285        0.761
  

This intial model has an average accuracy of 76.19% and a median accuracy of 76.2% over 20 attempts.

Our second iteration of the model added the additional features suggested by the textbook (first letter and length of name).

In [4]:
# Define the features for the second classifier and run the model


def gender_features_iter_2(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
    }


names_model_test(gender_features_iter_2)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.47831        0.732
             3          -0.42257        0.773
             4          -0.39519        0.775
             5          -0.38006        0.777
             6          -0.37091        0.778
             7          -0.36504        0.778
             8          -0.36110        0.780
             9          -0.35837        0.781
            10          -0.35644        0.781
            11          -0.35503        0.781
            12          -0.35399        0.781
            13          -0.35320        0.780
            14          -0.35260        0.779
            15          -0.35214        0.779
            16          -0.35177        0.779
            17          -0.35148        0.779
            18          -0.35125        0.779
            19          -0.35106        0.779
  

These additional features increased the average accuracy of the model over 20 attempts to 78.31% (the median accuracy increased to 78.2%).

Many of the model's errors involved names with double letters, so we added a feature that checked if the name contains a double letter.

In [5]:
# Create a function that checks if a word contains double letters
def double_letter_check(word):
    for i in range(len(word) - 1):
        if word[i] == word[i + 1]:
            return True
    return False


# Define the features for the third classifier and run the model
def gender_features_iter_3(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
        "doubles": double_letter_check(word),
    }


names_model_test(gender_features_iter_3)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.370
             2          -0.50637        0.696
             3          -0.44827        0.769
             4          -0.41505        0.778
             5          -0.39488        0.783
             6          -0.38185        0.785
             7          -0.37300        0.786
             8          -0.36675        0.786
             9          -0.36220        0.788
            10          -0.35882        0.787
            11          -0.35625        0.787
            12          -0.35427        0.787
            13          -0.35272        0.787
            14          -0.35150        0.786
            15          -0.35052        0.786
            16          -0.34973        0.786
            17          -0.34909        0.787
            18          -0.34856        0.787
            19          -0.34813        0.787
  

This new feature decreased the average accuracy of the model over 20 attempts to 77.82% and the median accuracy remained largely unchanged at 78.3%, so we removed it.

We then added features that checked if the ending of the name was a traditionally feminine sequence of letters ("ina", "ia", "a", or "lly") or masculine set of letters ("o", "on", "er", "tt").

In [6]:
# Define a function that checks if a word contains a feminized ending
def feminine_ending(word):
    if word[-3:] == "ina" or word[-2:] == "ia" or word[-1] == "a" or word[-3:] == "lly":
        return True
    return False


# Define a function that checks if a word contains a masculine ending
def masculine_ending(word):
    if (
        word[-1] == "o"
        or word[-2:] == "on"
        or word[-2:] == "er"
        or word[-2:] == "tt"
        or word[-2:] == "el"
    ):
        return True
    return False


# Define the features for the fourth classifier and run the model
def gender_features_iter_4(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
        "feminine_ending": feminine_ending(word),
        "masculine_ending": masculine_ending(word),
    }


names_model_test(gender_features_iter_4)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.373
             2          -0.48612        0.712
             3          -0.43535        0.775
             4          -0.40848        0.787
             5          -0.39190        0.786
             6          -0.38071        0.787
             7          -0.37273        0.788
             8          -0.36683        0.789
             9          -0.36236        0.789
            10          -0.35889        0.789
            11          -0.35616        0.790
            12          -0.35398        0.790
            13          -0.35221        0.790
            14          -0.35076        0.790
            15          -0.34957        0.789
            16          -0.34857        0.789
            17          -0.34774        0.789
            18          -0.34704        0.789
            19          -0.34644        0.789
  

This addition further improved the average accuracy of the model over 20 attempts to 78.58% (the median accuracy improved to 78.8%).

Since looking at the last group of letters instead of just the last letter seemed fruitful, we added additional features accounting for the first 2, last 2, first 3, and last 3 letters of the name.

In [7]:
# Define the features for the fifth classifier and run the model
def gender_features_iter_5(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
        "feminine_ending": feminine_ending(word),
        "masculine_ending": masculine_ending(word),
        "first_two": word[0:2].lower(),
        "last_two": word[-2:],
        "first_three": word[0:3].lower(),
        "last_three": word[-3:],
    }


names_model_test(gender_features_iter_5)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.372
             2          -0.42754        0.807
             3          -0.35459        0.854
             4          -0.31558        0.866
             5          -0.29072        0.873
             6          -0.27309        0.880
             7          -0.25968        0.886
             8          -0.24897        0.890
             9          -0.24013        0.894
            10          -0.23264        0.896
            11          -0.22617        0.899
            12          -0.22049        0.902
            13          -0.21544        0.904
            14          -0.21091        0.907
            15          -0.20682        0.909
            16          -0.20308        0.910
            17          -0.19966        0.911
            18          -0.19650        0.913
            19          -0.19358        0.913
  

These additions further improves the average accuracy of the model over 20 attempts to 81.99% (the median accuracy improved to 82%).

Several of the female names that were mis-classified as male names are real existing words (Sandy, Cat, Pearl, Sunny, Olive, Brook). To account for this, we included a feature that checks if the name is an existing English word.

In [8]:
# Create a function that checks if a word is an existing English word
from nltk.corpus import words


def real_word_name(word):
    if word.lower() in words.words():
        return True
    return False


# Create a list of all of the names that are existing English words
male_names = names.words("male.txt")
female_names = names.words("female.txt")
all_names = list(set(male_names + female_names))
names_df = pd.DataFrame({"name": all_names})
names_df["real_word"] = names_df.name.apply(real_word_name)
names_df = names_df[names_df.real_word == True]
word_names = list(names_df.name)
word_names_lower = {name.lower() for name in word_names}


# Checks the list of existing words to save time
def real_word_name_check(word):
    if word.lower() in word_names_lower:
        return True
    return False


# Define the features for the sixth classifier and run the model
def gender_features_iter_6(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
        "feminine_ending": feminine_ending(word),
        "masculine_ending": masculine_ending(word),
        "first_two": word[0:2].lower(),
        "last_two": word[-2:],
        "first_three": word[0:3].lower(),
        "last_three": word[-3:],
        "real_word": real_word_name_check(word),
    }


names_model_test(gender_features_iter_6)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.44185        0.793
             3          -0.36711        0.852
             4          -0.32596        0.864
             5          -0.29961        0.874
             6          -0.28096        0.880
             7          -0.26683        0.884
             8          -0.25561        0.888
             9          -0.24638        0.892
            10          -0.23858        0.895
            11          -0.23186        0.898
            12          -0.22597        0.901
            13          -0.22075        0.903
            14          -0.21607        0.906
            15          -0.21184        0.908
            16          -0.20798        0.910
            17          -0.20445        0.911
            18          -0.20119        0.912
            19          -0.19817        0.912
  

These additions did not improve the model's accuracy, so we discarded them.

Since focusing on the specific letters at the beginning and end of names had been fruitful, we included a feature that checks if the name begins with an unvoiced consonant (a stereotypically feminine name trait).



In [9]:
# Define a function that checks if a word starts with an unvoiced consonant
def unvoiced_beginning(word):
    if (
        word[0] == "f"
        or word[0] == "k"
        or word[0] == "c"
        or word[0] == "p"
        or word[0] == "s"
        or word[0] == "t"
        or word[0:2] == "sh"
        or word[0:2] == "ch"
    ):
        return True
    return False


# Define the features for the seventh classifier and run the model
def gender_features_iter_7(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
        "feminine_ending": feminine_ending(word),
        "masculine_ending": masculine_ending(word),
        "first_two": word[0:2].lower(),
        "last_two": word[-2:],
        "first_three": word[0:3].lower(),
        "last_three": word[-3:],
        "unvoiced": unvoiced_beginning(word),
    }


names_model_test(gender_features_iter_7)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.44295        0.791
             3          -0.36817        0.851
             4          -0.32688        0.864
             5          -0.30042        0.872
             6          -0.28169        0.878
             7          -0.26750        0.883
             8          -0.25623        0.887
             9          -0.24696        0.891
            10          -0.23914        0.893
            11          -0.23240        0.896
            12          -0.22651        0.898
            13          -0.22128        0.900
            14          -0.21660        0.902
            15          -0.21236        0.905
            16          -0.20851        0.907
            17          -0.20497        0.908
            18          -0.20171        0.910
            19          -0.19869        0.911
  

This addition also did not improve the model's accuracy, so we discarded it.

We weren't sure if the `masculine_ending` and `feminine_ending` features were redundant since we had added features for the last two and last three letters, so we tried removing them to see if it decreased the accuracy of the model.


In [10]:
# Define the features for the eighth classifier and run the model
def gender_features_iter_8(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
        "first_two": word[0:2].lower(),
        "last_two": word[-2:],
        "first_three": word[0:3].lower(),
        "last_three": word[-3:],
    }


names_model_test(gender_features_iter_8)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.372
             2          -0.40758        0.825
             3          -0.33314        0.862
             4          -0.29526        0.873
             5          -0.27172        0.881
             6          -0.25523        0.888
             7          -0.24276        0.892
             8          -0.23284        0.897
             9          -0.22466        0.900
            10          -0.21773        0.904
            11          -0.21174        0.906
            12          -0.20649        0.909
            13          -0.20183        0.910
            14          -0.19765        0.913
            15          -0.19387        0.914
            16          -0.19042        0.915
            17          -0.18727        0.916
            18          -0.18436        0.917
            19          -0.18167        0.917
  

The removal of these features improved the average model accuracy to 82.98% and the median model accuracy to 82.7%, so we removed them.

Previously, the double letters feature had not meaningfully affected the model's accuracy, but we wanted to test whether it interacted with the new features to further improve accuracy.

In [11]:
# Define the features for the ninth classifier and run the model
def gender_features_iter_9(word):
    return {
        "first_letter": word[0].lower(),
        "last_letter": word[-1],
        "length_of_name": len(word),
        "first_two": word[0:2].lower(),
        "last_two": word[-2:],
        "first_three": word[0:3].lower(),
        "last_three": word[-3:],
        "doubles": double_letter_check(word),
    }


names_model_test(gender_features_iter_9)

  ==> Training (20 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.372
             2          -0.43149        0.812
             3          -0.35224        0.859
             4          -0.31007        0.874
             5          -0.28372        0.881
             6          -0.26539        0.885
             7          -0.25167        0.891
             8          -0.24086        0.895
             9          -0.23202        0.899
            10          -0.22460        0.901
            11          -0.21823        0.904
            12          -0.21267        0.906
            13          -0.20775        0.910
            14          -0.20335        0.911
            15          -0.19938        0.911
            16          -0.19577        0.912
            17          -0.19247        0.913
            18          -0.18943        0.915
            19          -0.18662        0.916
  

This feature still does not improve the accuracy of the model. We finalized our features as:

1. The first letter of the name.
2. The last letter of the name.
3. The length of the name.
4. The first two letters of the name.
5. The last two letters of the name.
6. The first three letters of the name.
7. The last three letters of the name.

## TESTING PROCESS

Although we'd hypothesized that a maximum entropy classifier would perform better than a naive Bayes classifier, tried using a naive Bayes classifier as well just to confirm that the maximum entropy classifier was more accurate. 

In [12]:
# Run a naive Bayes classifier using the same process
bayes_results = []
for i in range(20):
    # Create devtest and training set split
    random.shuffle(trainset_names)
    devtest_names = trainset_names[:500]
    train_names = trainset_names[500:]
    devtest_featuresets = [(gender_features_iter_8(n), g) for (n, g) in devtest_names]
    train_featuresets = [(gender_features_iter_8(n), g) for (n, g) in train_names]

    # Train the classifier on the training set and print the accuracy
    bayes_model = nltk.NaiveBayesClassifier.train(train_featuresets)
    bayes_results.append(nltk.classify.accuracy(bayes_model, devtest_featuresets))

print("Average Naive Bayes Model Accuracy:", np.mean(bayes_results))
print("Median Naive Bayes Model Accuracy:", np.median(bayes_results))

Average Naive Bayes Model Accuracy: 0.8329000000000001
Median Naive Bayes Model Accuracy: 0.836


Despite our belief that a maximum entropy model would be more appropriate, the naive Bayes model has a higher mean and median accuracy over 20 attempts, so our final model is a naive Bayes model with the `gender_features_iter_8` feature set. We now use the model on the test set.

In [13]:
# Run the final model on the test set
test_featuresets = [(gender_features_iter_8(n), g) for (n, g) in test_names]
final_results = nltk.classify.accuracy(bayes_model, test_featuresets)
print("Final Model Accuracy on Test Set:", final_results)

Final Model Accuracy on Test Set: 0.834


The final model predicted 83.4% of the genders correctly (417 out of 500). We can analyze the 83 names that were classified incorrectly to get a sense of what future improvements could be made to the model.

In [16]:
# Create a list of prediction errors from the test set
errors_name = []
errors_guess = []
errors_actual = []
for name, tag in test_names:
    guess = bayes_model.classify(gender_features_iter_8(name))
    if guess != tag:
        errors_name.append(name)
        errors_guess.append(guess)
        errors_actual.append(tag)

errors_df = pd.DataFrame(
    {"name": errors_name, "guess": errors_guess, "actual": errors_actual}
)

pd.set_option("display.max_rows", None)
print(errors_df.to_string())

           name   guess  actual
0          Dido    male  female
1        Giorgi  female    male
2        Daniel    male  female
3         Rahel    male  female
4          Gert    male  female
5         Leiah    male  female
6          Pate  female    male
7       Mildred    male  female
8          Jess    male  female
9       Harriet    male  female
10       Sascha  female    male
11         Enid    male  female
12       Willie    male  female
13          Pat    male  female
14        Brice  female    male
15          Kit    male  female
16     Mercedes    male  female
17     Melicent    male  female
18     Giovanne  female    male
19      Rosario    male  female
20       Marten  female    male
21     Jermaine  female    male
22        Carey  female    male
23     Giuseppe  female    male
24       Meagan    male  female
25       Hilary    male  female
26       Heidie    male  female
27      Allyson    male  female
28     Ashleigh    male  female
29       Shayne  female    male
30  Cons

## CONCLUSION

Our model predicted 83.4% of the genders in the test set correctly. This was very close to its performance on the dev-test set (an average of 83.3% over 20 attempts). We took great pains to make sure that the test and dev-test sets were representative by:

1. Randomizing the names before selecting the test set;
2. Re-randomizing the data before selecting the dev-test set from the training data in each trial; and
3. Performing multiple trials and taking the average to estimate accuracy.

As a result, we expected that the model's performance on the test set would closely match its performance on the dev-test set, and we believe that our results speak for themselves.

Although there are some names where the model made an incorrect prediction that would be obviously wrong to a human observer (predicting that Meagan, Allyson, and Hilary were male, for example) it also made some incorrect predictions that were entirely reasonable (such as predicting that Austin, Randy, Bill, and Daniel were male). A more sophisticated model might use the maximum entropy classifier after all, and generate a probability distribution for each name rather than a simple prediction of which outcome is most likely.