### CUNY Data 620 - Web Analytics, Summer 2020  
**Group Project 3**   
**Prof:** Alain Ledon  
**Members:** Misha Kollontai, Amber Ferger, Zach Alexander, Subhalaxmi Rout  
  
**YouTube Link**: 

### Instructions
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python,
and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.


How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect? 

### Importing Packages

In [1]:
import nltk
from nltk.corpus import names
import random
import pandas as pd
import numpy as np
from itertools import groupby

### The Data

The *names* corpus in the nltk package contains the names and genders of 7,944 individuals. First, we will compile a list of all names with their gender. 

In [2]:
males = [(name, 'male') for name in names.words('male.txt')]
numMales = len(males)
females = [(name, 'female') for name in names.words('female.txt')]
numFemales = len(females)

print(f'There are {numMales} male names in the dataset.')
print(f'There are {numFemales} female names in the dataset.')

There are 2943 male names in the dataset.
There are 5001 female names in the dataset.


We can combine the lists and shuffle the data so that all names of the same gender are not together. We can confirm that the names are shuffled by looking at the genders of the first 5 individuals. 

In [3]:
random.seed(123)
allNames = males + females
random.shuffle(allNames)

print('First 5 names in the dataset:')
allNames[0:5]

First 5 names in the dataset:


[('Cordelie', 'female'),
 ('Peggie', 'female'),
 ('Solange', 'female'),
 ('Rana', 'female'),
 ('Jessy', 'female')]

### The Features
Next, we'll define a function to create features for our names. The initial features will include:
* **last_letter**: The last letter of the given name.
* **first_letter**: The first letter of the given name. 
* **name_length**: The length of the given name.
* **num_vowels**: The number of vowels in the given name.
* **num_consonants**: The number of consonants in the given name. 

In [4]:
def gender_features(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)  
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength

    return features

### Train-Test-Split
Now that we've defined our feature function, we can run it on our dataset and split it into training, testing, and dev testing sets. 
* **Training Set**: This data will be used to train our classifiers and fit the models.
* **Dev Test Set**: This data will be used to predict the gender (male or female). It will provide an unbiased evaluation of a model fit on the training dataset. We can use the results of the development set to tune our model. 
* **Test Set**: This data will be used to compute the accuracy of the final model. Since the model has never seen this data, it will provide an unbiased evaluation of the clasifier.

The splits will be in the format of ({features}, gender). We will store the names and genders of the individuals in separate lists for each split.

In [5]:
def tts(featureFunc, nameList):
    featureSet = [(featureFunc(n),g) for (n,g) in nameList]
    test_set, devtest_set, train_set = featureSet[0:500], featureSet[500:1000], featureSet[1000:] 
    tsName = nameList[0:500]
    dtName = nameList[500:1000]
    tName = nameList[1000:]
    
    return test_set, devtest_set, train_set, tsName, dtName, tName

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features, allNames)

print('Num records - train set: ', len(train_set))
print('Num records - dev test set: ', len(devtest_set))
print('Num records - test set: ', len(test_set))

Num records - train set:  6944
Num records - dev test set:  500
Num records - test set:  500


### Original Classifier - Naive Bayes Classifier
Now that we've split our data into training, development, and test sets, we can create a **Naive Bayes Classifier** to predict the gender of the names. In this type of model, each feature gets a say in determining which label should be assigned to a given input value. The prior probability is calculated for each label (male, female), and the contribution from each feature is combined with this probability to arrive at a likelihood estimate for each label.

We will measure the accuracy of the model (the percentage of names the classifier predicts correctly) using the development test set.

In [6]:
nbClass = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass, devtest_set))

Accuracy:  0.782


We can also take a look at the most important features used for predicting the gender. For each feature, this tells us the ratio of occurences for each gender.

In [7]:
nbClass.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     33.3 : 1.0
             last_letter = 'k'              male : female =     29.2 : 1.0
             last_letter = 'p'              male : female =     18.6 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0
             last_letter = 'd'              male : female =      9.8 : 1.0
             last_letter = 'm'              male : female =      9.2 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'w'              male : female =      8.0 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0
            first_letter = 'w'              male : female =      4.6 : 1.0
              num_vowels = 5              female : male   =      4.5 : 1.0
             last_letter = 'b'              male : female =      4.4 : 1.0

We can see that the last letter and number of vowels in the names appear to be the driving factors. 

We can also generate a list of errors to see which names we've classified improperly. This will help us identify what additional features we should add to make the classification more accurate. 

In [8]:
def pred_calc(nameList, featureFunc, nbClass):
    preds = []
    errors = []
    for (name,actual) in nameList:
        guess = nbClass.classify(featureFunc(name))
        preds.append((actual,guess,name))
        if guess != actual:
            errors.append((actual, guess, name))
    
    return preds, errors

preds, errors = pred_calc(dtName, gender_features, nbClass)
print('Number of errors:', len(errors))

Number of errors: 109


When we sort the errors by the last two characters of the first name, we can see that some combinations occur more frequently in males than females and vice versa. For example, the letters *ie* appear more often in male names and then letters *ly* appear more often in female names. Let's update our feature set to take this into account.

In [9]:
sorted(errors, key=lambda x: x[-1][-2:])

[('female', 'male', 'Em'),
 ('female', 'male', 'Talyah'),
 ('female', 'male', 'Shirah'),
 ('male', 'female', 'Donal'),
 ('female', 'male', 'Sam'),
 ('male', 'female', 'Fabian'),
 ('female', 'male', 'Sean'),
 ('male', 'female', 'Coleman'),
 ('male', 'female', 'Christian'),
 ('male', 'female', 'Adrian'),
 ('male', 'female', 'Vaughan'),
 ('female', 'male', 'Meggan'),
 ('female', 'male', 'Gay'),
 ('male', 'female', 'Murray'),
 ('male', 'female', 'Lawrence'),
 ('male', 'female', 'Bruce'),
 ('male', 'female', 'Lawerence'),
 ('male', 'female', 'Erich'),
 ('female', 'male', 'Dulcy'),
 ('male', 'female', 'Randi'),
 ('male', 'female', 'Lindy'),
 ('female', 'male', 'Freddy'),
 ('male', 'female', 'Jessee'),
 ('male', 'female', 'Mikel'),
 ('male', 'female', 'Nathaniel'),
 ('female', 'male', 'Pen'),
 ('female', 'male', 'Gwen'),
 ('female', 'male', 'Grier'),
 ('female', 'male', 'Delores'),
 ('female', 'male', 'Dew'),
 ('female', 'male', 'Sukey'),
 ('male', 'female', 'Carey'),
 ('female', 'male', 'Sop

### Feature Set Revamp

**Last two letters**: First, let's add in a feature for the last 2 letters of each name. We'll recreate our train, test, and dev test splits and run the Naive Bayes Classifer on the data.

In [10]:
def gender_features2(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)    
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength
    
    # add in feature for last 2 letters of name
    features['last_two_letters'] = name[-2:]
    
    # add in feature for first 2 letters of name
    features['first_two_letters'] = name[:2]
    
    # presence of double letters:
    def find_dbl_ltrs(x):
        groups = groupby(name)
        result = [(label, sum(1 for _ in group)) for label, group in groups]
        return (len([x[1] for x in result if x[1]>1]))
    features['dbl_ltrs'] = find_dbl_ltrs(name)

    return features

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features2, allNames)
nbClass2 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass2, devtest_set))

preds2, errors2 = pred_calc(dtName, gender_features2, nbClass2)
print('Number of errors:', len(errors2))

Accuracy:  0.82
Number of errors: 90


Our accuracy went up to 82%! Let's try again with some additional features.

**Bouba and Kiki Vowels/Consonants**: Sidhu and Pexman (1) discovered a relationship of Bouba with female first names and Kiki with male first names. We will use a modified version of their findings and define the following new features: 
* **num_bouba_cons**: Count of the letters *b*, *l*, *m*, and *n*. *(Female names tend to have more of these)*
* **num_bouba_vowels**: Count of the letters *u* and *o*. *(Female names tend to have more of these)*
* **num_kiki_cons**: Count of the letters *k*, *p*, and *t*. *(Male names tend to have more of these)*
* **num_kiki_vowels**: Count of the letters *i* and *e*. *(Male names tend to have more of these)*

In [11]:
# https://arxiv.org/pdf/1606.05467.pdf

def gender_features3(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)    
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength
    
    # add in feature for last 2 letters of name
    features['last_two_letters'] = name[-2:]
    
    # add in feature for first 2 letters of name
    features['first_two_letters'] = name[:2]
    
    # presence of double letters:
    def find_dbl_ltrs(x):
        groups = groupby(name)
        result = [(label, sum(1 for _ in group)) for label, group in groups]
        return (len([x[1] for x in result if x[1]>1]))
    features['dbl_ltrs'] = find_dbl_ltrs(name)
    
    # add in bouba & kiki counts
    boubaCons = ['b', 'l', 'm', 'n']
    boubaVowels = ['u', 'o']
    kikiCons = ['k', 'p', 't']
    kikiVowels = ['i', 'e']
    
    bcLength = len([i for i in name if i in boubaCons])
    bvLength = len([i for i in name if i in boubaVowels])
    kcLength = len([i for i in name if i in kikiCons])
    kvLength = len([i for i in name if i in kikiVowels])

    features['num_bouba_cons'] = bcLength
    features['num_bouba_vowels'] = bvLength
    features['num_kiki_cons'] = kcLength
    features['num_kiki_vowels'] = kvLength

    return features

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)
nbClass3 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass3, devtest_set))

Accuracy:  0.81


In [12]:
preds3, errors3 = pred_calc(dtName, gender_features3,nbClass3)
print('Number of errors:', len(errors3))

Number of errors: 95


### Evaluation
We can now evaluate the final model on our test set. First, we'll look at the overall accuracy of each of our subsequent models. 

In [13]:
pd.DataFrame([['First', nltk.classify.accuracy(nbClass, devtest_set), nltk.classify.accuracy(nbClass, test_set)], 
             ['Second', nltk.classify.accuracy(nbClass2, devtest_set), nltk.classify.accuracy(nbClass2, test_set)], 
             ['Final', nltk.classify.accuracy(nbClass3, devtest_set), nltk.classify.accuracy(nbClass3, test_set)]],
            columns = ['MODEL', 'DEV_ACCURACY', 'TEST_ACCURACY'])

Unnamed: 0,MODEL,DEV_ACCURACY,TEST_ACCURACY
0,First,0.782,0.772
1,Second,0.82,0.8
2,Final,0.81,0.802


We can see that the accuracy on the development and test set increases from the first model to the final model. When looking at each model, we also notice that the accuracy on the test set is lower than on the development set. This is expected, as we tweaked our feature set based on the results of the development set and the test set contains data that the model has never seen before.

In [14]:
dtPred, dtError = pred_calc(dtName, gender_features3,nbClass3)
tsPred, tsError = pred_calc(tsName, gender_features3,nbClass3)

In [15]:
def summ_table(allNames, tsPred):
    tag = [name for name in tsPred if [name for (name, tag) in allNames]]
    perform = []
    for i in tsPred:
        if (i[0] == 'male') & (i[1] == 'male'):
            perform.append('correct male')
        elif (i[0] == 'female') & (i[1] == 'female'):
            perform.append('correct female')
        elif (i[0] == 'male') & (i[1] == 'female'):
            perform.append('incorrect male')
        else:
            perform.append('incorrect female')
    correct_male = perform.count('correct male')
    correct_female = perform.count('correct female')
    incorrect_female = perform.count('incorrect female')
    incorrect_male = perform.count('incorrect male')
    
    performance_table = pd.DataFrame([['Females', correct_female, incorrect_female],
             ['Males', correct_male, incorrect_male]],
            columns = ['Gender', 'Correct Predictions', 'Incorrect Predictions'])
    performance_table.style.hide_index()
    performance_table_pct = pd.DataFrame([['Females', "{:.0%}".format(correct_female / (correct_female + incorrect_female)), "{:.0%}".format(incorrect_female / (correct_female + incorrect_female))],
             ['Males', "{:.0%}".format(correct_male / (correct_male + incorrect_male)), "{:.0%}".format(incorrect_male / (correct_male + incorrect_male))]],
            columns = ['Gender', 'Percent Correct', 'Percent Incorrect'])
    performance_table_pct.style.hide_index()
    
    return performance_table_pct

We then wanted to take a look at the relative accuracies with respect to the two genders. The table below breaks down the results of our final model. It shows that our model predicted female names with a greater accuracy than male names. 

In [16]:
performance_table_pct = summ_table(allNames, tsPred)
performance_table_pct

Unnamed: 0,Gender,Percent Correct,Percent Incorrect
0,Females,83%,17%
1,Males,76%,24%


Seeing the better accuracy for female names, we remembered that the dataset is skewed fairly heavily in favor of female names (63% female / 37% male). In order to see how much the greater accuracy for female names was driven by the disbalance within the dataset we decided to try two basic approaches of dealing with an imbalanced dataset: Undersampling and Oversampling. We decided to re-evaluate our model after adjusting the training data to be balanced - once by removing the extra female names and once by copying in repeats of male names to balance out the number of female names. 

#### Effect of undersampling the female set within the training data

We wrote a function that took both the training dataset [train_set] and the list of names associated within the training dataset [tName], evaluated which gender's names there were more of and performed either an undersampling of the greater set or an oversampling of the smaller (based on input from the user). We then applied this function to the two lists before running our model again. 

In [17]:
#'over' input below signifies whether the user wants to undersample or oversample the dataset (default undersample)
def balance_train(train_set, tName, under = 1):
    gender = []
    for name,g in tName:
        if g == "female":
            gender.append(1)
        else:
            gender.append(0)
    n_female = sum(gender)
    n_male = len(gender) - n_female
    if n_female == n_male:
        return train_set, tName
    elif n_female > n_male:
        more = "F"
        delta = n_female - n_male
    else:
        more = "M"
        delta = n_male - n_female
    
    idx_males = []
    idx_males = [i for i, val in enumerate(tName) if val[1] == "male"]
    idx_females = []
    idx_females = [i for i, val in enumerate(tName) if val[1] == "female"]
    
    remove = []
    copy = []
    if more == "F":
        remove = idx_females
        remove = remove[-delta:]
        copy = idx_males
    elif more == "M":
        remove = idx_males
        remove = remove[-delta:]
        copy = idx_females
    
    if under == 1:
        for index in reversed(remove):
            del tName[index]
            del train_set[index]
    elif under == 0:
        for i in range(0,delta):
            tName.append(tName[copy[i]])
            train_set.append(train_set[copy[i]])
    return train_set, tName

In [18]:
test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)

train_set, tName = balance_train(train_set, tName,1)

nbClass4 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass4, devtest_set))

preds4, errors4 = pred_calc(dtName, gender_features3, nbClass4)
print('Number of errors:', len(errors4))

Accuracy:  0.796
Number of errors: 102


In [19]:
tsPred4, tsError4 = pred_calc(tsName, gender_features3,nbClass4)
performance_table_pct = summ_table(allNames, tsPred4)
performance_table_pct

Unnamed: 0,Gender,Percent Correct,Percent Incorrect
0,Females,78%,22%
1,Males,81%,19%


#### Effect of oversampling the male set within the training data
For the oversampling of male data we wrote a similar function that evaluated how many fewer male names there were and appended that many copies of names from the lists themselves to even the numbers out. We again applied this function and re-ran the model. 

In [20]:
test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)

train_set, tName = balance_train(train_set, tName, 0)

nbClass5 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass5, devtest_set))

preds5, errors5 = pred_calc(dtName, gender_features3, nbClass5)
print('Number of errors:', len(errors5))

Accuracy:  0.798
Number of errors: 101


In [21]:
tsPred5, tsError5 = pred_calc(tsName, gender_features3,nbClass5)
summ_table(allNames, tsPred5)

Unnamed: 0,Gender,Percent Correct,Percent Incorrect
0,Females,78%,22%
1,Males,82%,18%


As we can see from the summary tables, both approaches to balancing the training dataset reduced the accuracy of predicting female names, but increased that of predicting male names. This is driven by the fact that the algorithm has fewer female names to determine patterns from, leading to less accurate patterns. This in turn increases the relative impact of the male-predictor patterns. 

#### Impact of original random_seed to split the data

The names that make their way into the training set will obviously have an impact on how accurate a predictor model is. Below are a few result tables showing the difference in accuracy based on different initial train-test splits.

In [22]:
def pull_correct(allNames, tsPred):
    tag = [name for name in tsPred if [name for (name, tag) in allNames]]
    perform = []
    for i in tsPred:
        if (i[0] == 'male') & (i[1] == 'male'):
            perform.append('correct male')
        elif (i[0] == 'female') & (i[1] == 'female'):
            perform.append('correct female')
        elif (i[0] == 'male') & (i[1] == 'female'):
            perform.append('incorrect male')
        else:
            perform.append('incorrect female')
            
    correct_male = perform.count('correct male')
    correct_female = perform.count('correct female')
    incorrect_female = perform.count('incorrect female')
    incorrect_male = perform.count('incorrect male')
    
    female_pct_corr = correct_female / (correct_female + incorrect_female)
    male_pct_corr = correct_male / (correct_male + incorrect_male)
    
    return female_pct_corr , male_pct_corr

In [24]:
n_seeds = 5
seeds = random.sample(range(0,1000),n_seeds)
iterables = [seeds,['F','M']]
index = pd.MultiIndex.from_product(iterables, names = ['Seed','Gender'])
df = pd.DataFrame(np.zeros((n_seeds*2, 3)),index =index, columns = ["Normal","Undersampled","Oversampled"])
counter = 0
for seed in seeds:
    random.seed(seed)
    allNames = males + females
    random.shuffle(allNames)
    test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)
    for i in range(0,3):
        train_set2 = []
        tname = []
        if i == 0:
            train_set2 = train_set
        elif i == 1:
            train_set2, tName = balance_train(train_set, tName, 1)
        elif i == 2:
            train_set2, tName = balance_train(train_set, tName, 0)
            
        nbClass = nltk.NaiveBayesClassifier.train(train_set2)
        tsPred, tsError = pred_calc(tsName, gender_features3,nbClass)
        f_pct, m_pct = pull_correct(allNames, tsPred)
        df.iloc[counter][i] = f_pct
        df.iloc[counter+1][i] = m_pct
    allNames =[]
    counter = counter + 2
    
df.style.format({
    'Normal': '{:,.1%}'.format,
    'Undersampled': '{:,.1%}'.format,
    'Oversampled': '{:,.1%}'.format,
})

Unnamed: 0_level_0,Unnamed: 1_level_0,Normal,Undersampled,Oversampled
Seed,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
582,F,85.2%,81.0%,81.0%
582,M,78.8%,82.5%,82.5%
257,F,81.5%,78.8%,78.8%
257,M,73.1%,80.3%,80.3%
886,F,79.7%,74.3%,74.3%
886,M,77.0%,80.5%,80.5%
983,F,80.5%,74.9%,74.9%
983,M,76.4%,80.7%,80.7%
198,F,80.7%,77.0%,77.0%
198,M,75.0%,79.9%,79.9%


As we can see in the table above our female name accuracy changes significantly even just for the Normal training set (ranging anywhere from below 80% to above 85%) depending on the random seed chosen to split the data into train and test datasets. 

### Discussion



### Resources

1. D. M. Sidhu and P. M. Pexman. What’s in a name? sound symbolism and gender in first names. PLOS ONE, 10(5):e0126809, 2015.
