## Project 3

Haig Bedros, Nori Selina, Julia Ferris, Matthew Roland

Project 3 - Your project should be submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their code and findings in our meetup. The ability to be an effective member of a virtual team is highly valued in the data science job market.  

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set?
Is this what you'd expect?
Source: Natural Language Processing with Python, exercise 6.10.2.

In [37]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import random

## Defining a function to extract gender features

This function simply returns both the last character of a word and the last 2 characters of a word, respectively. For our model, we will be using these suffices to classify whether a name is masculine or feminine.

In [30]:
def gender_features(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2]}

## Loading the Names Corpus

In [38]:
#nltk.download()
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])
random.seed(12345)
random.shuffle(names)

## Creating Train, Test, and Development Sets

In [52]:
featuresets = [(gender_features(n), g) for (n,g) in names]

train_set, test_set, dev_test = featuresets[0:501], featuresets[1002:], featuresets[501:1002]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print("Accuracy of test set:", nltk.classify.accuracy(classifier, test_set))
print('Accuracy of dev test set:', nltk.classify.accuracy(classifier, dev_test))

Accuracy of test set: 0.738692019590896
Accuracy of dev test set: 0.7385229540918163


As we can see, this classifier performs with an accuracy of ~.75, which is sufficient, but can certainly be improved upon.

In [51]:
dev_test_names = names[501:1002]

errors = []
for (name, tag) in dev_test_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8} name={:<30}'.format(tag, guess, name))
    

correct=female   guess=male     name=Aidan                         
correct=female   guess=male     name=Aileen                        
correct=female   guess=male     name=Alisun                        
correct=female   guess=male     name=Astrix                        
correct=female   guess=male     name=Bab                           
correct=female   guess=male     name=Bette-Ann                     
correct=female   guess=male     name=Bev                           
correct=female   guess=male     name=Birgit                        
correct=female   guess=male     name=Brigit                        
correct=female   guess=male     name=Cal                           
correct=female   guess=male     name=Carilyn                       
correct=female   guess=male     name=Carmon                        
correct=female   guess=male     name=Chad                          
correct=female   guess=male     name=Christean                     
correct=female   guess=male     name=Cloe       

The list of errors produced can give us hints regarding ways to modify our model. For instance, we can see that the model appears to overcorrect male names based on the last 2 characters of the string. Perhaps incorporating the first character of a string may ameliorate this issue

## Refining the Classification Model

In [70]:
def gender_features_2(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2],
            'prefix1': word[0],
            'prefix2': word[1],
            'length': len(word),
            'vowels': sum(1 for v in word.lower() if v in 'aeiou'),
            'consonants': sum(1 for c in word.lower() if c not in 'aeiou'),
            'suffix_vowel': word[0] in 'aeiou'}   

In [71]:
featuresets = [(gender_features_2(n), g) for (n,g) in names]

train_set_2, test_set_2, dev_test_2 = featuresets[0:501], featuresets[1002:], featuresets[501:1002]

classifier_2 = nltk.NaiveBayesClassifier.train(train_set_2)

print("Accuracy of test set:", nltk.classify.accuracy(classifier_2, test_set_2))
print('Accuracy of dev test set:', nltk.classify.accuracy(classifier_2, dev_test_2))

Accuracy of test set: 0.754393546528378
Accuracy of dev test set: 0.7644710578842315
