### CUNY Data 620 - Web Analytics, Summer 2020  
**Group Project 3**   
**Prof:** Alain Ledon  
**Members:** Misha Kollontai, Amber Ferger, Zach Alexander, Subhalaxmi Rout  
  
**YouTube Link**: 

### Instructions
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python,
and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.


How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect? 

### Importing Packages

In [1]:
import nltk
from nltk.corpus import names
import random
import pandas as pd
import numpy as np

### The Data

The *names* corpus in the nltk package contains the names and genders of 7,944 individuals. First, we will compile a list of all names with their gender. 

In [2]:
males = [(name, 'male') for name in names.words('male.txt')]
numMales = len(males)
females = [(name, 'female') for name in names.words('female.txt')]
numFemales = len(females)

print(f'There are {numMales} male names in the dataset.')
print(f'There are {numFemales} female names in the dataset.')

There are 2943 male names in the dataset.
There are 5001 female names in the dataset.


We can combine the lists and shuffle the data so that all names of the same gender are not together. We can confirm that the names are shuffled by looking at the genders of the first 5 individuals. 

In [3]:
random.seed(123)
allNames = males + females
random.shuffle(allNames)

print('First 5 names in the dataset:')
allNames[0:5]

First 5 names in the dataset:


[('Cordelie', 'female'),
 ('Peggie', 'female'),
 ('Solange', 'female'),
 ('Rana', 'female'),
 ('Jessy', 'female')]

### The Features
Next, we'll define a function to create features for our names. The initial features will include:
* **last_letter**: The last letter of the given name.
* **first_letter**: The first letter of the given name. 
* **name_length**: The length of the given name.
* **num_vowels**: The number of vowels in the given name.
* **num_consonants**: The number of consonants in the given name. 

In [4]:
def gender_features(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)  
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength

    return features

### Train-Test-Split
Now that we've defined our feature function, we can run it on our dataset and split it into training, testing, and dev testing sets. 
* **Training Set**: This data will be used to train our classifiers and fit the models.
* **Dev Test Set**: This data will be used to predict the gender (male or female). It will provide an unbiased evaluation of a model fit on the training dataset. We can use the results of the development set to tune our model. 
* **Test Set**: This data will be used to compute the accuracy of the final model. Since the model has never seen this data, it will provide an unbiased evaluation of the clasifier.

The splits will be in the format of ({features}, gender). We will store the names and genders of the individuals in separate lists for each split.

In [5]:
def tts(featureFunc, nameList):
    featureSet = [(featureFunc(n),g) for (n,g) in nameList]
    test_set, devtest_set, train_set = featureSet[0:500], featureSet[500:1000], featureSet[1000:] 
    tsName = nameList[0:500]
    dtName = nameList[500:1000]
    tName = nameList[1000:]
    
    return test_set, devtest_set, train_set, tsName, dtName, tName

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features, allNames)

print('Num records - train set: ', len(train_set))
print('Num records - dev test set: ', len(devtest_set))
print('Num records - test set: ', len(test_set))

Num records - train set:  6944
Num records - dev test set:  500
Num records - test set:  500


### Original Classifier - Naive Bayes Classifier
Now that we've split our data into training, development, and test sets, we can create a **Naive Bayes Classifier** to predict the gender of the names. In this type of model, each feature gets a say in determining which label should be assigned to a given input value. The prior probability is calculated for each label (male, female), and the contribution from each feature is combined with this probability to arrive at a likelihood estimate for each label.

We will measure the accuracy of the model (the percentage of names the classifier predicts correctly) using the development test set.

In [6]:
nbClass = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass, devtest_set))

Accuracy:  0.782


We can also take a look at the most important features used for predicting the gender. For each feature, this tells us the ratio of occurences for each gender.

In [7]:
nbClass.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     33.3 : 1.0
             last_letter = 'k'              male : female =     29.2 : 1.0
             last_letter = 'p'              male : female =     18.6 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0
             last_letter = 'd'              male : female =      9.8 : 1.0
             last_letter = 'm'              male : female =      9.2 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'w'              male : female =      8.0 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0
            first_letter = 'w'              male : female =      4.6 : 1.0
              num_vowels = 5              female : male   =      4.5 : 1.0
             last_letter = 'b'              male : female =      4.4 : 1.0

We can see that the last letter and number of vowels in the names appear to be the driving factors. 

We can also generate a list of errors to see which names we've classified improperly. This will help us identify what additional features we should add to make the classification more accurate. 

In [40]:
def pred_calc(nameList, featureFunc):
    preds = []
    errors = []
    for (name,actual) in nameList:
        guess = nbClass.classify(featureFunc(name))
        preds.append((actual,guess,name))
        if guess != actual:
            errors.append((actual, guess, name))
    
    return preds, errors

preds, errors = pred_calc(dtName, gender_features)
print('Number of errors:', len(errors))

Number of errors: 109


When we sort the errors by the last two characters of the first name, we can see that some combinations occur more frequently in males than females and vice versa. For example, the letters *ie* appear more often in male names and then letters *ly* appear more often in female names. Let's update our feature set to take this into account.

In [38]:
sorted(errors, key=lambda x: x[-1][-2:])

[('female', 'male', 'Em'),
 ('female', 'male', 'Talyah'),
 ('female', 'male', 'Shirah'),
 ('male', 'female', 'Donal'),
 ('female', 'male', 'Sam'),
 ('male', 'female', 'Fabian'),
 ('female', 'male', 'Sean'),
 ('male', 'female', 'Coleman'),
 ('male', 'female', 'Christian'),
 ('male', 'female', 'Adrian'),
 ('male', 'female', 'Vaughan'),
 ('female', 'male', 'Meggan'),
 ('female', 'male', 'Gay'),
 ('male', 'female', 'Murray'),
 ('male', 'female', 'Lawrence'),
 ('male', 'female', 'Bruce'),
 ('male', 'female', 'Lawerence'),
 ('male', 'female', 'Erich'),
 ('female', 'male', 'Dulcy'),
 ('male', 'female', 'Randi'),
 ('male', 'female', 'Lindy'),
 ('female', 'male', 'Freddy'),
 ('male', 'female', 'Jessee'),
 ('male', 'female', 'Mikel'),
 ('male', 'female', 'Nathaniel'),
 ('female', 'male', 'Pen'),
 ('female', 'male', 'Gwen'),
 ('female', 'male', 'Grier'),
 ('female', 'male', 'Delores'),
 ('female', 'male', 'Dew'),
 ('female', 'male', 'Sukey'),
 ('male', 'female', 'Carey'),
 ('female', 'male', 'Sop

### Feature Set Revamp

**Last two letters**: First, let's add in a feature for the last 2 letters of each name. We'll recreate our train, test, and dev test splits and run the Naive Bayes Classifer on the data.

In [11]:
def gender_features2(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)    
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength
    
    # add in feature for lsat 2 letters of name
    features['last_two_letters'] = name[-2:]

    return features

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features2, allNames)
nbClass2 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass2, devtest_set))

Accuracy:  0.79


Our accuracy went up to 79%! Let's try again with some additional features.

**Bouba and Kiki Vowels/Consonants**: Sidhu and Pexman (1) discovered a relationship of Bouba with female first names and Kiki with male first names. We will use a modified version of their findings and define the following new features: 
* **num_bouba_cons**: Count of the letters *b*, *l*, *m*, and *n*. *(Female names tend to have more of these)*
* **num_bouba_vowels**: Count of the letters *u* and *o*. *(Female names tend to have more of these)*
* **num_kiki_cons**: Count of the letters *k*, *p*, and *t*. *(Male names tend to have more of these)*
* **num_kiki_vowels**: Count of the letters *i* and *e*. *(Male names tend to have more of these)*

In [12]:
# https://arxiv.org/pdf/1606.05467.pdf

def gender_features3(name):
    name = name.lower()
    features = {}
    features['last_letter'] = name[-1]
    features['first_letter'] = name[0]
    features['name_length'] = len(name)    
    vowels = ['a', 'e', 'i', 'o', 'u']
    vowelLength = len([i for i in name if i in vowels])
    features['num_vowels'] = vowelLength
    features['num_consonants'] = len(name) - vowelLength
    
    # add in feature for last 2 letters of name
    features['last_two_letters'] = name[-2:]
    
    # add in bouba & kiki counts
    boubaCons = ['b', 'l', 'm', 'n']
    boubaVowels = ['u', 'o']
    kikiCons = ['k', 'p', 't']
    kikiVowels = ['i', 'e']
    
    bcLength = len([i for i in name if i in boubaCons])
    bvLength = len([i for i in name if i in boubaVowels])
    kcLength = len([i for i in name if i in kikiCons])
    kvLength = len([i for i in name if i in kikiVowels])

    features['num_bouba_cons'] = bcLength
    features['num_bouba_vowels'] = bvLength
    features['num_kiki_cons'] = kcLength
    features['num_kiki_vowels'] = kvLength

    return features

test_set, devtest_set, train_set, tsName, dtName, tName = tts(gender_features3, allNames)
nbClass3 = nltk.NaiveBayesClassifier.train(train_set)
print('Accuracy: ', nltk.classify.accuracy(nbClass3, devtest_set))

Accuracy:  0.794


### Evaluation
We can now evaluate the final model on our test set. First, we'll look at the overall accuracy of each of our subsequent models. 

In [23]:
pd.DataFrame([['First', nltk.classify.accuracy(nbClass, devtest_set), nltk.classify.accuracy(nbClass, test_set)], 
             ['Second', nltk.classify.accuracy(nbClass2, devtest_set), nltk.classify.accuracy(nbClass2, test_set)], 
             ['Final', nltk.classify.accuracy(nbClass3, devtest_set), nltk.classify.accuracy(nbClass3, test_set)]],
            columns = ['MODEL', 'DEV_ACCURACY', 'TEST_ACCURACY'])

Unnamed: 0,MODEL,DEV_ACCURACY,TEST_ACCURACY
0,First,0.782,0.772
1,Second,0.79,0.78
2,Final,0.794,0.782


We can see that the accuracy on the development and test set increases from the first model to the final model. When looking at each model, we also notice that the accuracy on the test set is lower than on the development set. This is expected, as we tweaked our feature set based on the results of the development set and the test set contains data that the model has never seen before.

In [43]:
dtPred, dtError = pred_calc(dtName, gender_features3)
tsPred, tsError = pred_calc(tsName, gender_features3)

### Discussion


### Resources

1. D. M. Sidhu and P. M. Pexman. What’s in a name? sound symbolism and gender in first names. PLOS ONE, 10(5):e0126809, 2015.
