### CUNY Data 620 - Web Analytics, Summer 2020  
**Group Project 3**   
**Prof:** Alain Ledon  
**Members:** Misha Kollontai, Amber Ferger, Zach Alexander, Subhalaxmi Rout  
  
**YouTube Link**: https://www.youtube.com/watch?v=a3oDaz4SHSY

### Instructions
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python,
and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.


How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect? 

### Importing Packages

In [1]:
import nltk
from nltk.corpus import names
import random
import pandas as pd
import numpy as np
from itertools import groupby
import math
from itertools import repeat

### The Data

The *names* corpus in the nltk package contains the names and genders of 7,944 individuals. First, we will compile a list of all names with their gender. 

In [2]:
males = [(name, 'male') for name in names.words('male.txt')]
numMales = len(males)
females = [(name, 'female') for name in names.words('female.txt')]
numFemales = len(females)

print(f'There are {numMales} male names in the dataset.')
print(f'There are {numFemales} female names in the dataset.')

There are 2943 male names in the dataset.
There are 5001 female names in the dataset.


We can combine the lists and shuffle the data so that all names of the same gender are not together. We can confirm that the names are shuffled by looking at the genders of the first 5 individuals. 

In [3]:
random.seed(123)
allNames = males + females
random.shuffle(allNames)

print('First 5 names in the dataset:')
allNames[0:5]

First 5 names in the dataset:


[('Cordelie', 'female'),
 ('Peggie', 'female'),
 ('Solange', 'female'),
 ('Rana', 'female'),
 ('Jessy', 'female')]

### Original Features
Next, we'll define a function to create features for our names. The initial features will include:
* **last_letter**: The last letter of the given name.
* **first_letter**: The first letter of the given name. 
* **name_length**: The length of the given name.
* **num_vowels**: The number of vowels in the given name.
* **num_consonants**: The number of consonants in the given name. 

In [4]:
def gender_features(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)  
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength

    return features

### Train-Test-Split
Now that we've defined our feature function, we can run it on our dataset and split it into training, testing, and dev testing sets. 
* **Training Set**: This data will be used to train our classifiers and fit the models.
* **Dev Test Set**: This data will be used to predict the gender (male or female). It will provide an unbiased evaluation of a model fit on the training dataset. We can use the results of the development set to tune our model. 
* **Test Set**: This data will be used to compute the accuracy of the final model. Since the model has never seen this data, it will provide an unbiased evaluation of the clasifier.

The splits will be in the format of ({features}, gender). We will store the names and genders of the individuals in separate lists for each split.

In [5]:
def tts(featureFunc, nameList):
    featureSet = [(featureFunc(n),g) for (n,g) in nameList]
    test_set, devtest_set, train_set = featureSet[0:500], featureSet[500:1000], featureSet[1000:] 
    tsName = nameList[0:500]
    dtName = nameList[500:1000]
    tName = nameList[1000:]
    
    return test_set, devtest_set, train_set, tsName, dtName, tName

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features, allNames)

print('Num records - train set: ', len(train_set))
print('Num records - dev test set: ', len(devtest_set))
print('Num records - test set: ', len(test_set))

Num records - train set:  6944
Num records - dev test set:  500
Num records - test set:  500


### Original Classifier - Naive Bayes Classifier
Now that we've split our data into training, development, and test sets, we can create a **Naive Bayes Classifier** to predict the gender of the names. In this type of model, each feature gets a say in determining which label should be assigned to a given input value. The prior probability is calculated for each label (male, female), and the contribution from each feature is combined with this probability to arrive at a likelihood estimate for each label.

We will measure the accuracy of the model (the percentage of names the classifier predicts correctly) using the development test set.

In [6]:
nbClass = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass, devtest_set))

Accuracy:  0.782


We can also take a look at the most important features used for predicting the gender. For each feature, this tells us the ratio of occurences for each gender.

In [7]:
nbClass.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     33.3 : 1.0
             last_letter = 'k'              male : female =     29.2 : 1.0
             last_letter = 'p'              male : female =     18.6 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0
             last_letter = 'd'              male : female =      9.8 : 1.0
             last_letter = 'm'              male : female =      9.2 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'w'              male : female =      8.0 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0
            first_letter = 'w'              male : female =      4.6 : 1.0
              num_vowels = 5              female : male   =      4.5 : 1.0
             last_letter = 'b'              male : female =      4.4 : 1.0

We can see that the last letter and number of vowels in the names appear to be the driving factors. 

We can also generate a list of errors to see which names we've classified improperly. This will help us identify what additional features we should add to make the classification more accurate. 

In [8]:
def pred_calc(nameList, featureFunc, nbClass):
    preds = []
    errors = []
    for (name,actual) in nameList:
        guess = nbClass.classify(featureFunc(name))
        preds.append((actual,guess,name))
        if guess != actual:
            errors.append((actual, guess, name))
    
    return preds, errors

preds, errors = pred_calc(dtName, gender_features, nbClass)
print('Number of errors:', len(errors))

Number of errors: 109


When we sort the errors by the last two characters of the first name, we can see that some combinations occur more frequently in males than females and vice versa. For example, the letters *ie* appear more often in male names and then letters *ly* appear more often in female names. Let's update our feature set to take this into account.

In [10]:
sorted(errors, key=lambda x: x[-1][-2:])[:10]

[('female', 'male', 'Em'),
 ('female', 'male', 'Talyah'),
 ('female', 'male', 'Shirah'),
 ('male', 'female', 'Donal'),
 ('female', 'male', 'Sam'),
 ('male', 'female', 'Fabian'),
 ('female', 'male', 'Sean'),
 ('male', 'female', 'Coleman'),
 ('male', 'female', 'Christian'),
 ('male', 'female', 'Adrian')]

### Feature Set Revamp

Now that we have a baseline for comparison, let's add some features to our dataset. For each iteration with new features, we will recreate our train, test, and dev test splits and run the Naive Bayes Classifer on the data.

#### Model 2
Model 2 will include 3 additional features:
* **last_two_letters**: Last 2 letters of the name.
* **first_two_letters**: First 2 letters of the name. 
* **dbl_ltrs**: Presence of double letters (ex: *tt*) in a name. 

In [11]:
def gender_features2(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)    
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength
    
    # add in feature for last 2 letters of name
    features['last_two_letters'] = name[-2:]
    
    # add in feature for first 2 letters of name
    features['first_two_letters'] = name[:2]
    
    # presence of double letters:
    def find_dbl_ltrs(x):
        groups = groupby(name)
        result = [(label, sum(1 for _ in group)) for label, group in groups]
        return (len([x[1] for x in result if x[1]>1]))
    features['dbl_ltrs'] = find_dbl_ltrs(name)

    return features

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features2, allNames)
nbClass2 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass2, devtest_set))

preds2, errors2 = pred_calc(dtName, gender_features2, nbClass2)
print('Number of errors:', len(errors2))

Accuracy:  0.82
Number of errors: 90


Our accuracy went up to **82%**! Let's try again with some additional features.

#### Model 3
Our third model will include the addition of **Bouba and Kiki Vowels/Consonants**. Sidhu and Pexman (1) discovered a relationship of Bouba with female first names and Kiki with male first names. We will use a modified version of their findings and define the following new features: 
* **num_bouba_cons**: Count of the letters *b*, *l*, *m*, and *n*. *(Female names tend to have more of these)*
* **num_bouba_vowels**: Count of the letters *u* and *o*. *(Female names tend to have more of these)*
* **num_kiki_cons**: Count of the letters *k*, *p*, and *t*. *(Male names tend to have more of these)*
* **num_kiki_vowels**: Count of the letters *i* and *e*. *(Male names tend to have more of these)*

In [14]:
# https://arxiv.org/pdf/1606.05467.pdf

def gender_features3(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)    
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength
    
    # add in feature for last 2 letters of name
    features['last_two_letters'] = name[-2:]
    
    # add in feature for first 2 letters of name
    features['first_two_letters'] = name[:2]
    
    # presence of double letters:
    def find_dbl_ltrs(x):
        groups = groupby(name)
        result = [(label, sum(1 for _ in group)) for label, group in groups]
        return (len([x[1] for x in result if x[1]>1]))
    features['dbl_ltrs'] = find_dbl_ltrs(name)
    
    # add in bouba & kiki counts
    boubaCons = ['b', 'l', 'm', 'n']
    boubaVowels = ['u', 'o']
    kikiCons = ['k', 'p', 't']
    kikiVowels = ['i', 'e']
    
    bcLength = len([i for i in name if i in boubaCons])
    bvLength = len([i for i in name if i in boubaVowels])
    kcLength = len([i for i in name if i in kikiCons])
    kvLength = len([i for i in name if i in kikiVowels])

    features['num_bouba_cons'] = bcLength
    features['num_bouba_vowels'] = bvLength
    features['num_kiki_cons'] = kcLength
    features['num_kiki_vowels'] = kvLength

    return features

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)
nbClass3 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass3, devtest_set))

preds3, errors3 = pred_calc(dtName, gender_features3, nbClass3)
print('Number of errors:', len(errors3))

Accuracy:  0.81
Number of errors: 95


Interestingly, the addition of these features actually *decreased* the accuracy of predictions on the development set. This could be because our training data is **overfit** to the features, which brings up an important point - more features does not always mean a better model!

### Evaluation
#### Overall Accuracy
We can now evaluate the models on our **test set**. First, we'll look at the overall accuracy of each of our subsequent models. 

In [22]:
pd.DataFrame([['First', nltk.classify.accuracy(nbClass, devtest_set), nltk.classify.accuracy(nbClass, test_set)], 
             ['Second', nltk.classify.accuracy(nbClass2, devtest_set), nltk.classify.accuracy(nbClass2, test_set)], 
             ['Third', nltk.classify.accuracy(nbClass3, devtest_set), nltk.classify.accuracy(nbClass3, test_set)]],
            columns = ['MODEL', 'DEV_ACCURACY', 'TEST_ACCURACY'])

Unnamed: 0,MODEL,DEV_ACCURACY,TEST_ACCURACY
0,First,0.782,0.772
1,Second,0.82,0.8
2,Third,0.81,0.802


We can see that the accuracy on the development set increases from the first model to the second model, and then decreases from the second model to the third model. However, it is interesting to note that the accuracy on the test set actually increases from the first model to the third! 

When looking at each model, we also notice that the accuracy on the test set is lower than on the development set. This is expected, as we tweaked our feature set based on the results of the development set and the test set contains data that the model has never seen before.

**Based on these results, we will use model 3 as our final model.**

#### Gender Specific Accuracies
Now that we've looked at the overall accuracy, let's take a look at the male and female specific accuracies. We'll create a function that includes a break down the results of our final model. 

In [26]:
# gender-specific accuracy function
def summ_table(allNames, tsPred):
    tag = [name for name in tsPred if [name for (name, tag) in allNames]]
    perform = []
    for i in tsPred:
        if (i[0] == 'male') & (i[1] == 'male'):
            perform.append('correct male')
        elif (i[0] == 'female') & (i[1] == 'female'):
            perform.append('correct female')
        elif (i[0] == 'male') & (i[1] == 'female'):
            perform.append('incorrect male')
        else:
            perform.append('incorrect female')
    correct_male = perform.count('correct male')
    correct_female = perform.count('correct female')
    incorrect_female = perform.count('incorrect female')
    incorrect_male = perform.count('incorrect male')
    
    performance_table_pct = pd.DataFrame([['Females', "{:.0%}".format(correct_female / (correct_female + incorrect_female)), "{:.0%}".format(incorrect_female / (correct_female + incorrect_female))],
             ['Males', "{:.0%}".format(correct_male / (correct_male + incorrect_male)), "{:.0%}".format(incorrect_male / (correct_male + incorrect_male))]],
            columns = ['Gender', 'Percent Correct', 'Percent Incorrect'])
    performance_table_pct.style.hide_index()
    
    return performance_table_pct

In [27]:
tsPred, tsErrors = pred_calc(tsName, gender_features3, nbClass3)
performance_table_pct = summ_table(allNames, tsPred)
performance_table_pct

Unnamed: 0,Gender,Percent Correct,Percent Incorrect
0,Females,83%,17%
1,Males,76%,24%


We can see that in our final model, the females are predicted with a higher accuracy than the males! This is likely because the dataset is skewed in favor of female names (63% female / 37% male). In order to see how much the greater accuracy for female names was driven by the imbalance within the dataset, we will balance the set using two different approaches: **Undersampling** and **Oversampling**. We will re-evaluate our model after adjusting the training data - once by removing the extra female names (undersampling) and once by copying in repeats of male names to balance out the number of female names (oversampling). 

#### Balancing the Training Set

We will define a function that takes the training set, the names associated with the training set, and a 0 or 1 depending on whether an undersampling or an oversampling is preferred. Once our balanced data is created, we will re-create the model and look at the overall and gender-specific accuracies.

In [28]:
#'over' input below signifies whether the user wants to undersample or oversample the dataset (default undersample)
def balance_train(train_set, tName, under = 1):
    gender = []
    for name,g in tName:
        if g == "female":
            gender.append(1)
        else:
            gender.append(0)
    n_female = sum(gender)
    n_male = len(gender) - n_female
    if n_female == n_male:
        return train_set, tName
    elif n_female > n_male:
        more = "F"
        delta = n_female - n_male
    else:
        more = "M"
        delta = n_male - n_female
    
    idx_males = []
    idx_males = [i for i, val in enumerate(tName) if val[1] == "male"]
    idx_females = []
    idx_females = [i for i, val in enumerate(tName) if val[1] == "female"]
    
    remove = []
    copy = []
    if more == "F":
        remove = idx_females
        remove = remove[-delta:]
        copy = idx_males
    elif more == "M":
        remove = idx_males
        remove = remove[-delta:]
        copy = idx_females
    
    if under == 1:
        for index in reversed(remove):
            del tName[index]
            del train_set[index]
    elif under == 0:
        for i in range(0,delta):
            tName.append(tName[copy[i]])
            train_set.append(train_set[copy[i]])
    return train_set, tName

#### Effect of undersampling the female set within the training data

In [34]:
test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)
train_set, tName = balance_train(train_set, tName,1)

nbClass4 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy - Downsampling: ', nltk.classify.accuracy(nbClass4, devtest_set))

preds4, errors4 = pred_calc(dtName, gender_features3, nbClass4)
print('Number of errors - Downsampling:', len(errors4))

tsPred4, tsError4 = pred_calc(tsName, gender_features3,nbClass4)
performance_table_pct = summ_table(allNames, tsPred4)
performance_table_pct

Accuracy - Downsampling:  0.796
Number of errors - Downsampling: 102


Unnamed: 0,Gender,Percent Correct,Percent Incorrect
0,Females,78%,22%
1,Males,81%,19%


These results are certainly interesting - by **undersampling** the female set, we see that the overall accuracy remains about the same (~80%), but the gender-specific accuracies change! We now see that the males are being predicted at a higher rate than the females!

#### Effect of oversampling the male set within the training data

In [35]:
test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)
train_set, tName = balance_train(train_set, tName, 0)

nbClass5 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy - Oversampling: ', nltk.classify.accuracy(nbClass5, devtest_set))

preds5, errors5 = pred_calc(dtName, gender_features3, nbClass5)
print('Number of errors - Oversampling:', len(errors5))

tsPred5, tsError5 = pred_calc(tsName, gender_features3,nbClass5)
summ_table(allNames, tsPred5)

Accuracy - Oversampling:  0.798
Number of errors - Oversampling: 101


Unnamed: 0,Gender,Percent Correct,Percent Incorrect
0,Females,78%,22%
1,Males,82%,18%


Once again, we see that the overall accuracy remains about the same. The gender-specific accuracies are more balanced! Ultimately, since we'd like to predict male and female names at a similar accuracy, undersampling and oversampling help to create a more balanced model.

#### Impact of original random_seed to split the data

The names that make their way into the training set will obviously have an impact on how accurate a predictor model is. Below are a few result tables showing the difference in accuracy based on different initial train-test splits.

In [36]:
def pull_correct(allNames, tsPred):
    tag = [name for name in tsPred if [name for (name, tag) in allNames]]
    perform = []
    for i in tsPred:
        if (i[0] == 'male') & (i[1] == 'male'):
            perform.append('correct male')
        elif (i[0] == 'female') & (i[1] == 'female'):
            perform.append('correct female')
        elif (i[0] == 'male') & (i[1] == 'female'):
            perform.append('incorrect male')
        else:
            perform.append('incorrect female')
            
    correct_male = perform.count('correct male')
    correct_female = perform.count('correct female')
    incorrect_female = perform.count('incorrect female')
    incorrect_male = perform.count('incorrect male')
    
    female_pct_corr = correct_female / (correct_female + incorrect_female)
    male_pct_corr = correct_male / (correct_male + incorrect_male)
    
    return female_pct_corr , male_pct_corr

In [37]:
n_seeds = 5
seeds = random.sample(range(0,1000),n_seeds)
iterables = [seeds,['F','M']]
index = pd.MultiIndex.from_product(iterables, names = ['Seed','Gender'])
df = pd.DataFrame(np.zeros((n_seeds*2, 3)),index =index, columns = ["Normal","Undersampled","Oversampled"])
counter = 0
for seed in seeds:
    random.seed(seed)
    allNames = males + females
    random.shuffle(allNames)
    test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)
    for i in range(0,3):
        train_set2 = []
        tname = []
        if i == 0:
            train_set2 = train_set
        elif i == 1:
            train_set2, tName = balance_train(train_set, tName, 1)
        elif i == 2:
            train_set2, tName = balance_train(train_set, tName, 0)
            
        nbClass = nltk.NaiveBayesClassifier.train(train_set2)
        tsPred, tsError = pred_calc(tsName, gender_features3,nbClass)
        f_pct, m_pct = pull_correct(allNames, tsPred)
        df.iloc[counter][i] = f_pct
        df.iloc[counter+1][i] = m_pct
    allNames =[]
    counter = counter + 2
    
df.style.format({
    'Normal': '{:,.1%}'.format,
    'Undersampled': '{:,.1%}'.format,
    'Oversampled': '{:,.1%}'.format,
})

Unnamed: 0_level_0,Unnamed: 1_level_0,Normal,Undersampled,Oversampled
Seed,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
865,F,81.9%,76.8%,76.8%
865,M,80.0%,84.9%,84.9%
888,F,80.1%,75.5%,75.5%
888,M,78.6%,81.5%,81.5%
616,F,79.6%,75.2%,75.2%
616,M,78.5%,80.1%,80.1%
171,F,82.9%,80.3%,80.3%
171,M,74.2%,78.4%,78.4%
163,F,83.3%,80.9%,80.9%
163,M,80.7%,83.0%,83.0%


As we can see in the table above our female name accuracy changes significantly even just for the Normal training set (ranging anywhere from below 80% to above 85%) depending on the random seed chosen to split the data into train and test datasets. 

### Maximum Entropy Classification

Our last step in this project is to create a Maximum Entropy Classifier for the data.

The Maximum Entropy classifier model is a generalization of the model used by the Naive Bayes classifier. Like the Naive Bayes model, the Maximum Entropy classifier calculates the likelihood of each label for a given input value by multiplying together the parameters that are applicable for the input value and label.

Let's calculate the entropy of the labels for our dataset. Higher entropy implies better classification algorithm.

In [17]:
# make a list of male and female
all_male_female = list(repeat('male', len(males))) + list(repeat('female', len(females)))
def entropy(labels):    
    freq_dist = nltk.FreqDist(labels)    
    probs = [freq_dist.freq(i) for i in nltk.FreqDist(labels)]    
    return -sum([j * math.log(j,2) for j in probs])

print (entropy(all_male_female))    

0.951030970454714


Above function shows that entropy is at 95%. <br> Let's create maximum entropy classifier model based on the features using training, dev test, and test sets. We will apply the model with 3 different feature sets i.e gender_features, gender_features2, and gender_features3.

In [18]:
test_set_1, devtest_set_1, train_set_1, tsName_1, dtName_1, tName_1 = tts(gender_features, allNames)
test_set_2, devtest_set_2, train_set_2, tsName_2, dtName_2, tName_2 = tts(gender_features2, allNames)
test_set_3, devtest_set_3, train_set_3, tsName_3, dtName_3, tName_3 = tts(gender_features3, allNames)

In [43]:
%%capture
classifier_1 = nltk.classify.MaxentClassifier.train(train_set_1)
preds_1, errors_1 = pred_calc(dtName_1, gender_features, classifier_1)

In [44]:
%%capture
classifier_2 = nltk.classify.MaxentClassifier.train(train_set_2)
preds_2, errors_2 = pred_calc(dtName_2, gender_features2, classifier_2)

In [45]:
%%capture
classifier_3 = nltk.classify.MaxentClassifier.train(train_set_3)
preds_3, errors_3 = pred_calc(dtName_3, gender_features3, classifier_3)

Lets put all 3 features in a tablular format and see the accuracy of devtest_set and test_set.  

In [42]:
pd.DataFrame([['First', nltk.classify.accuracy(classifier_1, devtest_set_1), nltk.classify.accuracy(classifier_1, test_set_1)],
             ['Second', nltk.classify.accuracy(classifier_2, devtest_set_2), nltk.classify.accuracy(classifier_2, test_set_2)], 
             ['Final', nltk.classify.accuracy(classifier_3, devtest_set_3), nltk.classify.accuracy(classifier_3, test_set_3)]],
            columns = ['MODEL', 'DEV_ACCURACY', 'TEST_ACCURACY'])

Unnamed: 0,MODEL,DEV_ACCURACY,TEST_ACCURACY
0,First,0.806,0.772
1,Second,0.798,0.828
2,Final,0.812,0.83


We can see that our Max Entropy classifier uses an iterative method to maximize the performance of the training corpus classification. In this case the default number of iteration was 100. Due to this  it takes a long time to train a huge dataset and could also explain why it is not as popular. 

### Discussion



In the end, we created a Naive Bayes Classifier to predict the gender of a given name from our Names corpus. With a robust corpus of 7944 names, we randomized and split the data into a test set of 500 names (for final testing of our model), a development test set of 500 names to utilize while tweaking and adjusting our model features, and a training set that we used throughout to train our model -- which contained 6944 names.

##### Identifying the most informative features and creating our classifier

After several runs of our Naive Bayes Classifier, we were able to pinpoint some of the most informative features:

+ the value of the first and last few letters of a given name
+ the number of vowels present in a given name
+ the presence of double letters (i.e. "ee", "oo", etc.) in a given name
+ the length of a given name
+ the number of consonants in a given name
+ and the presence of certain letters in a given name (based on research, some letters seemed to be present more frequently in male or female names)

With our features identified, we were able to use our final classifier on our test set of 500 names. We found that our classifier, and these features were able to successfully predict the gender of about 80% of the names in our test corpus. 

##### Evaluating the performance of our classifier and resampling our training dataset

Although this was interesting, we did realize that there was an unequal distribution of male and female names present in our Names corpus, and thus found that the prediction accuracy was slightly higher for females than for males in our final evaluation. We can attribute this to the fact that the model had more female names to train on, which ultimately led to a slightly better performance when we subjected it to our final test set.

To investigate this further, we decided to retrain our model in two ways: 

1) We **undersampled the female names** in our training set in order to create a more equal proportion of female/male names for our classifier to determine patterns from -- to do this, we randomly removed about 2000 female names from our training data

2) We **oversampled the male names** in our training set in order to match the number of female names -- to do this, we randomly copied about 2000 male names to match the number of female names present in the training data


##### Summary of our findings

After training our data with these new splits, we then subjected this new classifier on our test data and found that the overall performance decreased slightly (by about 1%). As we can see from the summary tables, both approaches to balancing the training dataset reduced the accuracy of predicting female names, but increased that of predicting male names. This was to be expected since the resampled training data and the corresponding algorithm had fewer female names to determine patterns from, leading to less accurate predictions. However, this in turn increased the relative impact of the male-predictor patterns.

Overall, we were able to implement a Naive Bayes Classifier and Maximum Entropy Classifier on our names corpus, and after conducting a few different sampling techniques, we generated a classifier that preformed quite well (~80% accuracy predicting female and male names).

### Resources

1. D. M. Sidhu and P. M. Pexman. What’s in a name? sound symbolism and gender in first names. PLOS ONE, 10(5):e0126809, 2015.
