## Project 3

This is a collaborative project conducted by the Fall 2017 students of DATA 620 at The City University of New York, in partial fulfillment of the requirements for the MS in Data Science degree.

### Problem Description

This is a Team Project! For this project, please work with the entire class as one collaborative group! Your project should be submitted (as an IPython Notebook via GitHub) by end of day on Monday, October 25th. The group should present their code and findings in our meet-up on Tuesday October 26th. The ability to be an effective member of a virtual team is highly valued in the data science job market.
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

### Contributors Include

* Joy Payton
* Keith Folsom
* Sonya Hong


### First, Obtain the Corpus

Note: If not already executed, nltk.download() will allow you access to the names corpus

In [1]:
import nltk
from nltk.corpus import names
import random

nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Sonya\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] + \
         [(name, 'female') for name in names.words('female.txt')])

In [3]:
random.shuffle(names)

# let's see what the randomly shuffles names look like
names[1:10]

[(u'Joane', 'female'),
 (u'Ariella', 'female'),
 (u'Gigi', 'female'),
 (u'Taddeus', 'male'),
 (u'Adam', 'male'),
 (u'Alfredo', 'male'),
 (u'Cristi', 'female'),
 (u'Louisette', 'female'),
 (u'Brier', 'female')]

### Create three subsets for development and error analysis of the models.

##### Development set:
* 6900 names for the training set
* 500 names for the dev-test set  

##### Test set:
* 500 names for the testing set

In [4]:
test_names, devtest_names, train_names = names[0:500], names[500:1000], names[1000:]

In [5]:
# Confirm the size of the three subsets
print("Training Set = {}".format(len(train_names)))
print("Dev-Test Set = {}".format(len(devtest_names)))
print("Test Set = {}".format(len(test_names)))

Training Set = 6944
Dev-Test Set = 500
Test Set = 500


### Feature Extractor Functions

This section below is to incrementallly improve the feature extraction functions which are subsequently applied to the development and test datasets.

In [6]:
# book example
def gender_features(name):
    return {'last_letter': name[-1]}

# most names beginning with a vowel are associated with females
def gender_features2(name):
    return {'first_letter': name[0]}

# from the book feature extractor that overfits 
def gender_features3(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

## from the book
def gender_features4(name):
    return {"suffix1": name[-1:], "suffix2": name[-2:]}

### Helper Functions

In [7]:
# Generic function to generate an error list based the arguments provided
# Accepts the classifer, names dataset, and the extractor function
# Returns the list of errors

def generate_errors(classifier, dataset, extractor_function): 
    
    errors = [] 

    for (name, tag) in dataset:
        guess = classifier.classify(extractor_function(name)) 
        if guess != tag: 
            errors.append((tag, guess, name))
            
    return errors

In [8]:
# Generic function to display classification errors
# Accepts the error list and an optional argument to show only n number of errors

def show_errors(errors, n=None):
   
    if n is not None: errors = errors[:n]
            
    for (tag, guess, name) in sorted(errors): 
        print('correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name))

### Gender Identification Models (Try Some Models)

###  NaiveBayes
#### Gender Classification Model 1

In [9]:
# apply the first gender_feature extractor function from the book the the three datasets
train_set = [(gender_features(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features(n), g)  for (n, g) in test_names]

In [10]:
classifier = nltk.NaiveBayesClassifier.train(train_set) 

In [11]:
# Examine the likelihood ratios
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = u'a'           female : male   =     35.2 : 1.0
             last_letter = u'k'             male : female =     29.1 : 1.0
             last_letter = u'f'             male : female =     15.2 : 1.0
             last_letter = u'p'             male : female =     11.1 : 1.0
             last_letter = u'm'             male : female =      9.4 : 1.0


In [12]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.762


In [13]:
# display the classification errors
# calls the helper functions above
show_errors(generate_errors(classifier, devtest_names, gender_features))

correct=female   guess=male     name=Adel                          
correct=female   guess=male     name=Aleen                         
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Allyn                         
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Babs                          
correct=female   guess=male     name=Caitlin                       
correct=female   guess=male     name=Caril                         
correct=female   guess=male     name=Charmion                      
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Christen                      
correct=female   guess=male     name=Christian                     
correct=female   guess=male     name=Chrystal                      
correct=female   guess=male     name=Ciel       

#### Gender Classification Model 2

In [14]:
# apply the gender_feature3 extractor function from the book the the three datasets

train_set = [(gender_features3(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features3(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features3(n), g)  for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set) 

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.786


In [15]:
# Examine the likelihood ratios
classifier.show_most_informative_features(20)

Most Informative Features
              lastletter = u'a'           female : male   =     35.2 : 1.0
              lastletter = u'k'             male : female =     29.1 : 1.0
              lastletter = u'f'             male : female =     15.2 : 1.0
              lastletter = u'p'             male : female =     11.1 : 1.0
              lastletter = u'm'             male : female =      9.4 : 1.0
                count(v) = 2              female : male   =      9.3 : 1.0
              lastletter = u'd'             male : female =      9.1 : 1.0
              lastletter = u'o'             male : female =      9.0 : 1.0
              lastletter = u'v'             male : female =      8.4 : 1.0
              lastletter = u'g'             male : female =      7.6 : 1.0
              lastletter = u'r'             male : female =      6.7 : 1.0
              lastletter = u'w'             male : female =      5.1 : 1.0
             firstletter = u'w'             male : female =      4.7 : 1.0

In [16]:
show_errors(generate_errors(classifier, devtest_names, gender_features3), 30)

correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Ardyth                        
correct=female   guess=male     name=Betsy                         
correct=female   guess=male     name=Christen                      
correct=female   guess=male     name=Dorcas                        
correct=female   guess=male     name=Gerry                         
correct=female   guess=male     name=Guenevere                     
correct=female   guess=male     name=Gunvor                        
correct=female   guess=male     name=Honey                         
correct=female   guess=male     name=Margot                        
correct=female   guess=male     name=Meridith                      
correct=female   guess=male     name=Mikako                        
correct=female   guess=male     name=Odette                        
correct=female   guess=male     name=Tobie                         
correct=female   guess=male     name=Wendy      

#### Gender Classification Model 3

In [17]:
# apply the gender_feature4 extractor function from the book the the three datasets

train_set = [(gender_features4(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features4(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features4(n), g)  for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set) 

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.798


In [18]:
# Examine the likelihood ratios
classifier.show_most_informative_features(20)

Most Informative Features
                 suffix2 = u'na'          female : male   =     95.2 : 1.0
                 suffix2 = u'la'          female : male   =     72.8 : 1.0
                 suffix2 = u'ia'          female : male   =     37.9 : 1.0
                 suffix1 = u'a'           female : male   =     35.2 : 1.0
                 suffix2 = u'ra'          female : male   =     35.0 : 1.0
                 suffix2 = u'sa'          female : male   =     34.9 : 1.0
                 suffix1 = u'k'             male : female =     29.1 : 1.0
                 suffix2 = u'us'            male : female =     24.8 : 1.0
                 suffix2 = u'ta'          female : male   =     24.1 : 1.0
                 suffix2 = u'rd'            male : female =     24.1 : 1.0
                 suffix2 = u'rt'            male : female =     21.0 : 1.0
                 suffix2 = u'ld'            male : female =     19.5 : 1.0
                 suffix2 = u'os'            male : female =     18.2 : 1.0

In [19]:
show_errors(generate_errors(classifier, devtest_names, gender_features4), 30)

correct=female   guess=male     name=Adel                          
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Ardyth                        
correct=female   guess=male     name=Caril                         
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Christen                      
correct=female   guess=male     name=Dorcas                        
correct=female   guess=male     name=Gerry                         
correct=female   guess=male     name=Gunvor                        
correct=female   guess=male     name=Jaleh                         
correct=female   guess=male     name=Karel                         
correct=female   guess=male     name=Linell                        
correct=female   guess=male     name=Lisabeth                      
correct=female   guess=male     name=Margot     

###  Decision Tree
#### Gender Classification Model 4

In [20]:
# apply the first gender_feature extractor function from the book the the three datasets
train_set = [(gender_features(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features(n), g)  for (n, g) in test_names]
classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.762


In [21]:
# display the classification errors
# calls the helper functions above
show_errors(generate_errors(classifier, devtest_names, gender_features))

correct=female   guess=male     name=Adel                          
correct=female   guess=male     name=Aleen                         
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Allyn                         
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Babs                          
correct=female   guess=male     name=Caitlin                       
correct=female   guess=male     name=Caril                         
correct=female   guess=male     name=Charmion                      
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Christen                      
correct=female   guess=male     name=Christian                     
correct=female   guess=male     name=Chrystal                      
correct=female   guess=male     name=Ciel       

#### Gender Classification Model 5

In [22]:
# apply the gender_feature3 extractor function from the book the the three datasets

train_set = [(gender_features3(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features3(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features3(n), g)  for (n, g) in test_names]

classifier = nltk.DecisionTreeClassifier.train(train_set)

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.818


In [23]:
show_errors(generate_errors(classifier, devtest_names, gender_features3), 30)

correct=female   guess=male     name=Adel                          
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Blakelee                      
correct=female   guess=male     name=Gerry                         
correct=female   guess=male     name=Guenevere                     
correct=female   guess=male     name=Gunvor                        
correct=female   guess=male     name=Honey                         
correct=female   guess=male     name=Jaleh                         
correct=female   guess=male     name=Jilly                         
correct=female   guess=male     name=Marjorie                      
correct=female   guess=male     name=Meridith                      
correct=female   guess=male     name=Misty                         
correct=female   guess=male     name=Odette                        
correct=female   guess=male     name=Tally                         
correct=female   guess=male     name=Teriann    

#### Gender Classification Model 6

In [24]:
# apply the gender_feature4 extractor function from the book the the three datasets

train_set = [(gender_features4(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features4(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features4(n), g)  for (n, g) in test_names]

classifier = nltk.DecisionTreeClassifier.train(train_set)

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.806


In [25]:
show_errors(generate_errors(classifier, devtest_names, gender_features4), 30)

correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Caril                         
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Chloe                         
correct=female   guess=male     name=Dorcas                        
correct=female   guess=male     name=Gunvor                        
correct=female   guess=male     name=Joice                         
correct=female   guess=male     name=Linell                        
correct=female   guess=male     name=Lust                          
correct=female   guess=male     name=Margot                        
correct=female   guess=male     name=Nance                         
correct=male     guess=female   name=Ambrose                       
correct=male     guess=female   name=Anatoly                       
correct=male     guess=female   name=Aubrey     

### Model Selection (Choose Best Candidate) 

#### Check the model's final performance on the test set. 

#### How does the performance on the test set compare to the performance on the dev-test set? 

#### Is this what you'd expect?